了解 Linux 网络内部结构

了解 Linux 网络内部结构

Understanding Linux Network Internals

克里斯蒂安·本韦努蒂

Christian Benvenuti

由奥莱利媒体出版

北京 ⋅ 剑桥 ⋅ 法纳姆 ⋅ 科隆 ⋅ 塞巴斯托波尔 ⋅ 东京

Beijing ⋅ Cambridge ⋅ Farnham ⋅ Köln ⋅ Sebastopol ⋅ Tokyo

前言

Preface

如今,网络比以往任何时候都更成为热门话题。任何最新一代的电子产品都嵌入了某种网络功能。互联网的人口和机会不断扩大。像 Linux 这样强大、免费且功能丰富的操作系统被许多嵌入式设备生产商广泛接受,这并不奇怪。其网络功能使其成为任何类型网络设备的最佳操作系统。它已有的功能得到了很好的实现,并且可以轻松添加新功能。如果您是嵌入式设备开发人员或想要尝试 Linux 的学生,这本书将为您提供很好的素材。

Today more than ever before, networking is a hot topic. Any electronic gadget in its latest generation embeds some kind of networking capability. The Internet continues to broaden in its population and opportunities. It should not come as a surprise that a robust, freely available, and feature-rich operating system like Linux is well accepted by many producers of embedded devices. Its networking capabilities make it an optimal operating system for networking devices of any kind. The features it already has are well implemented, and new ones can be added easily. If you are a developer for embedded devices or a student who would like to experiment with Linux, this book will provide you with good fodder.

使用 Linux 的纯软件产品的性能无法与依赖专用硬件帮助的商业产品竞争。这当然不是对软件的批评;而是对软件的批评。这是对专用硬件和通用 CPU 之间速度差异结果的简单认识。然而,Linux 绝对可以与完全基于软件的低端商业产品竞争。当然,对 Linux 内核的简单扩展允许供应商在混合系统(软件和硬件)上使用 Linux;只需编写必要的设备驱动程序即可。

The performance of a pure software-based product that uses Linux cannot compete with commercial products that can count on the help of specialized hardware. This of course is not a criticism of software; it is a simple recognition of the consequence of the speed difference between dedicated hardware and general-purpose CPUs. However, Linux can definitely compete with low-end commercial products that are entirely software-based. Of course, simple extensions to the Linux kernel allow vendors to use Linux on hybrid systems as well (software and hardware); it is only a matter of writing the necessary device drivers.

Linux 也经常被用作实施大学项目和论文的首选操作系统。并非所有这些都进入了官方内核(至少不是立即)。其中一些是这样做的,而另一些则只是作为官方内核的补丁在线提供。看到您对 Linux 内核的贡献被潜在的数百万用户使用,这不是一种巨大的满足和奖励吗?只有一个缺点:如果您的贡献确实受到赞赏,您可能无法应对大量的感谢或请求帮助的电子邮件。

Linux is also often used as the operating system of choice for the implementation of university projects and theses. Not all of them make it to the official kernel (not right away, at least). A few do, and others are simply made available online as patches to the official kernel. Isn't it a great satisfaction and reward to see your contribution to the Linux kernel being used by potentially millions of users? There is only one drawback: if your contribution is really appreciated, you may not be able to cope with the numerous emails of thanks or requests for help.

在过去的几年里,Linux 的发展势头一直在持续增长,而且显然它只会继续增长。

The momentum for Linux has been growing continually over the past years, and apparently it can only keep growing.

大约 10 年前,我在博洛尼亚大学第一次接触 Linux,当时我还是该校计算机科学专业的研究生。多么美妙的软件啊!我可以在家里的 i286/486 计算机上完成图像处理项目,而不必与其他学生竞争大学实验室中可用的几个 Sun 站的访问权限。

I first encountered Linux at the University of Bologna, where I was a grad student in computer science around 10 years ago. What a wonderful piece of software! I could work on my image processing projects at home on an i286/486 computer without having to compete with other students for access to the few Sun stations available at the university labs.

从那时起,我与Linux的婚姻就再也没有出现过灰色的日子。它甚至开始取代我对辉煌的 C64 一代的美好回忆,当时我第一次接触到使用汇编语言和 BASIC 的各种方言进行编程。是的,我属于 C64 一代,在某种程度上,我可以将我第一次使用 C64 编程的经历与我第一次进入 Linux 内核的乐趣进行比较。

Since then, my marriage to Linux has never seen a gray day. It has even started to displace my fond memories of the glorious C64 generation, when I was first introduced to programming with Assembly language and the various dialects of BASIC. Yes, I belong to the C64 generation, and to some extent I can compare the joy of my first programming experiences with the C64 to my first journeys into the Linux kernel.

当我第一次接触美丽的网络世界时,我开始使用 Linux 上提供的工具。我还有幸在意大利的联合国教科文组织中心工作,在那里我帮助开发了他们完全基于 Linux 盒子的网络课程。这让我有机会进入一个配备了各种网络设备和文档的优秀实验室,以及大量可供学习和协作的 Linux 爱好者。

When I was first introduced to the beautiful world of networking, I started playing with the tools available on Linux. I also had the fortune to work for a UNESCO center in Italy where I helped develop their networking courses, based entirely on Linux boxes. That gave me access to a good lab equipped with all sorts of network devices and documentation, plus plenty of Linux enthusiasts to learn from and to collaborate with.

不幸的是,为了我自己内心的平静(但幸运的是,我希望这本书的读者能够从结果中受益),我是那种喜欢理解一切并且不认为理所当然的人。因此,在联合国教科文组织,我开始研究内核代码。这不仅被证明是加深知识的好方法,而且还让我对使用用户空间配置工具更有信心:每当配置工具没有提供特定选项时,我通常都知道它是否会是是否可以添加它,或者是否需要对内核进行重大更改。这种学习变成了一条没有尽头的路:你总是想要更多。

Unfortunately for my own peace of mind (but fortunately, I hope, for the reader of this book who benefits from the results), I am the kind of person that likes to understand everything and takes very little for granted. So at UNESCO, I started looking into the kernel code. This not only proved to be a good way to burn in my knowledge, but it also gave me more confidence in making use of user-space configuration tools: whenever a configuration tool did not provide a specific option, I usually knew whether it would be possible to add it or whether it would have required significant changes to the kernel. This kind of study turns into a path without an end: you always want more.

在开发了一些工具作为 Linux 内核的扩展(2.0 和 2.2 版本的一些修订版)之后,我对操作系统和网络的热爱让我来到了硅谷(思科系统公司)。当你学习一门语言时,无论是人类语言还是计算机编程语言,都会出现一条规则:你知道的语言越多,学习新语言就越容易。您可以识别每个系统的优点和缺点,了解设计妥协背后的原因等。这同样适用于操作系统。

After developing a few tools as extensions to the Linux kernel (some revision of versions 2.0 and 2.2), my love for operating systems and networking led me to the Silicon Valley (Cisco Systems). When you learn a language, be it a human language or a computer programming language, a rule emerges: the more languages you know, the easier it becomes to learn new ones. You can identify each one's strengths and weaknesses, see the reasons behind design compromises, etc. The same applies to operating systems.

当我注意到缺乏有关 Linux 内核网络代码的良好文档以及有关内核其他部分的好书时,我决定尝试填补这一空白——或者至少是部分空白。我希望这本书能为您提供我多年前就希望拥有的入门文档。

When I noticed the lack of good documentation about the networking code of the Linux kernel and the availability of good books for other parts of the kernel, I decided to try filling in the gap—or at least part of it. I hope this book will give you the starting documentation that I would have loved to have had years ago.

我相信,这本书以及 O'Reilly 的另外两本内核书籍(Understanding the Linux KernelLinux Device Drivers)对于任何愿意更多地了解 Linux 内核内部原理的人来说都是一个很好的起点。它们相互补充,当它们没有解决给定的功能时,会向读者指出外部文档源(如果可用)。

I believe that this book, together with O'Reilly's other two kernel books (Understanding the Linux Kernel and Linux Device Drivers), represents a good starting point for anyone willing to learn more about the Linux kernel internals. They complement each other and, when they do not address a given feature, point the reader to external documentation sources (when available).

不过,我仍然建议您喝杯咖啡,打开音乐,花一些时间研究源代码,尝试了解给定功能是如何实现的。我相信以这种方式建立的知识比以任何其他方式建立的知识持续时间更长。走捷径固然好,但有时候走远路也有它的好处。

However, I still suggest you make some coffee, turn on the music, and spend some time on the source code trying to understand how a given feature is implemented. I believe the knowledge you build in this way lasts longer than that built in any other way. Shortcuts are good, but sometimes the long way has its advantages, too.

本书的读者

The Audience for This Book

本书可以帮助那些已经具备一定网络知识并希望了解互联网引擎(即互联网协议(IP)及其朋友)如何在一流操作系统上实现的人。不过,每个主题都有理论介绍,因此新手也能够快速上手。复杂的主题附有足够的示例,使它们更容易理解。

This book can help those who already have some knowledge of networking and would like to see how the engine of the Internet—that is, the Internet Protocol (IP) and its friends—is implemented on a first-class operating system. However, there is a theoretical introduction for each topic, so newcomers will be able to get up to speed quickly, too. Complex topics are accompanied by enough examples to make them easier to follow.

Linux 不仅仅支持基本的 IP;它还有很多高级功能。更重要的是,它的实现必须足够复杂,才能与其他内核功能(例如对称多处理(SMP)和内核抢占)很好地配合。这使得 Linux 内核的网络代码成为一个非常好的健身房,可以在其中训练和保持您的网络知识。

Linux doesn't just support basic IP; it also has quite a few advanced features. More important, its implementation must be sophisticated enough to play nicely with other kernel features such as symmetric multiprocessing (SMP) and kernel preemption. This makes the networking code of the Linux kernel a very good gym in which to train and keep your networking knowledge in shape.

而且,如果你像我一样想要学习一切,你会在这本书中找到足够的细节让你满意相当长的一段时间。

Moreover, if you are like me and want to learn everything, you will find enough details in this book to keep you satisfied for quite a while.

背景资料

Background Information

一些操作系统知识会有所帮助。与操作系统的任何其他组件一样,网络代码必须遵循常识和与内核其余部分共存的隐式规则,包括正确使用锁定;合理使用内存和CPU;并着眼于模块化、代码整洁和良好的性能。尽管我偶尔会花时间在这些方面,但我还是建议您参考前面提到的另外两本 O'Reilly 内核书籍,以对通用操作系统服务和设计进行更深入、更详细的讨论。

Some knowledge of operating systems would help. The networking code, like any other component of the operating system, must follow both common sense and implicit rules for coexistence with the rest of the kernel, including proper use of locking; fair use of memory and CPU; and an eye toward modularity, code cleanliness, and good performance. Even though I occasionally spend time on those aspects, I refer you to the other two O'Reilly kernel books mentioned earlier for a deeper and detailed discussion on generic operating system services and design.

一些网络知识,尤其是 IP 知识也会有所帮助。然而,我认为本书中每个实现描述之前的理论概述足以使本书对于新手和有经验的读者来说都是独立的。

Some knowledge of networking, and especially IP, would also help. However, I think the theory overview that precedes each implementation description in this book is sufficient to make the book self-contained for both newcomers and experienced readers.

本书所涵盖主题的理论描述不需要任何编程经验。然而,相关实现的描述需要 C 语言的中级知识。第 1 章将介绍代码中经常使用的一系列编码约定和技巧,这对那些 C 和内核编程经验较少的人尤其有帮助。

The theoretical description of the topics covered in the book does not require any programming experience. However, the descriptions of the associated implementations require an intermediate knowledge of the C language. Chapter 1 will go through a series of coding conventions and tricks that are often used in the code, which should help especially those with less experience with C and kernel programming.

材料的组织

Organization of the Material

网络代码的某些方面需要多达七章,而其他方面一章就足够了。当主题很复杂或大到足以跨越不同的章节时,本书中专门讨论该主题的部分总是以概念章节开始,该概念章节涵盖了理解实现所需的理论,这将在另一章中进行描述。所有参考和辅助材料通常位于该部分末尾的一个杂章中。无论主题有多大,都使用相同的方案来组织其演示。

Some aspects of networking code require as many as seven chapters, while for other aspects one chapter is sufficient. When the topic is complex or big enough to span different chapters, the part of the book devoted to that topic always starts with a concept chapter that covers the theory necessary to understand the implementation, which is described in another chapter. All of the reference and secondary material is usually located in one miscellaneous chapter at the end of the part. No matter how big the topic is, the same scheme is used to organize its presentation.

对于每个主题,实施描述包括:

For each topic, the implementation description includes:

  • 大图,显示了所描述的内核组件在网络堆栈中的位置。

  • The big picture, which shows where the described kernel component falls in the network stack.

  • 主要数据结构的简要描述以及显示它们如何相互关联的图表。

  • A brief description of the main data structures and a figure that shows how they relate to each other.

  • 描述组件与其他内核的接口,例如通过通知链或数据结构交叉引用。防火墙就是这种内核功能的一个例子,因为它在网络代码中拥有大量的钩子。

  • A description of which other kernel features the component interfaces with—for example, by means of notification chains or data structure cross-references. The firewall is an example of such a kernel feature, given the numerous hooks it has all over the networking code.

  • 广泛使用流程图和图形,可以更轻松地浏览代码并从大型且看似复杂的函数中提取逻辑。

  • Extensive use of flow charts and figures to make it easier to go through the code and extract the logic from big and seemingly complex functions.

参考材料始终包括:

The reference material always includes:

  • 按字段详细描述最重要的数据结构

  • A detailed description of the most important data structures, field by field

  • 包含所有函数、宏和数据结构简要描述的表格,您可以将其用作快速参考

  • A table with a brief description of all functions, macros, and data structures, which you can use as a quick reference

  • 本章中提到的文件列表,以及它们在内核源代码树中的位置

  • A list of the files mentioned in the chapter, with their location in the kernel source tree

  • 用于配置本章主题的最常见用户空间工具与内核之间的接口描述

  • A description of the interface between the most common user-space tools used to configure the topic of the chapter and the kernel

  • /proc中导出的任何文件的描述

  • A description of any file in /proc that is exported

Linux 内核的网络代码不仅仅是一个移动的目标,而且是一个快速的跑步者。本书并未涵盖所有网络功能。当您阅读时,可能正在添加新的内容。许多新功能是由单个用户或组织的需求或作为大学项目驱动的,但当它们被认为对大量受众有用时,它们就会进入官方内核。除了详细说明这些功能的子集的实现之外,我还尝试让您了解功能的通用实现可能是什么样子。这将极大地帮助您理解代码的更改并了解新功能的实现方式。例如,对于任何功能,您需要考虑以下几点:

The Linux kernel's networking code is not just a moving target, but a fast runner. The book does not cover all of the networking features. New ones are probably being added right now while you are reading. Many new features are driven by the needs of single users or organizations, or as university projects, but they find their way into the official kernel when they're considered useful for a large audience. Besides detailing the implementation of a subset of those features, I try to give you an idea of what the generic implementation of a feature might look like. This will help you greatly in understanding changes to the code and learning how new features are implemented. For example, given any feature, you need to take the following points into consideration:

  • 如何设计数据结构和锁定语义?

  • How do you design the data structures and the locking semantics?

  • 是否需要用户空间配置工具?如果是这样,它是否会通过现有的系统调用、命令ioctl/proc文件或 Netlink 套接字与内核交互?

  • Is there a need for a user-space configuration tool? If so, is it going to interact with the kernel via an existing system call, an ioctl command, a /proc file, or the Netlink socket?

  • 是否需要新的通知链,是否需要注册到现有的链?

  • Is there any need for a new notification chain, and is there a need to register to an already existing chain?

  • 和防火墙有什么关系?

  • What is the relationship with the firewall?

  • 是否需要缓存、垃圾收集机制、统计等?

  • Is there any need for a cache, a garbage collection mechanism, statistics, etc.?

以下是本书涵盖的主题列表:

Here is the list of topics covered in the book:

用户空间和内核空间之间的接口
Interface between user space and kernel

第 3 章中,您将简要概述网络配置工具用于与内核内的对应工具进行交互的机制。这不会是详细的讨论,但它将帮助您理解内核代码的某些部分。

In Chapter 3, you will get a brief overview of the mechanisms that networking configuration tools use to interact with their counterparts inside the kernel. It will not be a detailed discussion, but it will help you to understand certain parts of the kernel code.

系统初始化
System initialization

第二部分描述了网络代码关键组件的初始化,以及网络设备如何注册和初始化。

Part II describes the initialization of key components of the networking code, and how network devices are registered and initialized.

设备驱动程序和协议处理程序之间的接口
Interface between device drivers and protocol handlers

第三部分详细描述了设备驱动程序如何将入口(传入或接收)数据包传递给上层协议,反之亦然。

Part III offers a detailed description of how ingress (incoming or received) packets are handed by the device drivers to the upper-layer protocols, and vice versa.

桥接
Bridging

第四部分描述了透明桥接和生成树协议,它是 L3(第三层)路由的 L2(第二层)对应项。

Part IV describes transparent bridging and the Spanning Tree Protocol, the L2 (Layer two) counterpart of routing at L3 (Layer three).

互联网协议版本 4 (IPv4)
Internet Protocol Version 4 (IPv4)

第五部分描述了如何在 IPv4 层本地接收、传输、转发和交付数据包。

Part V describes how packets are received, transmitted, forwarded, and delivered locally at the IPv4 layer.

IPv4 和传输层 (L4) 协议之间的接口
Interface between IPv4 and the transport layer (L4) protocols

第 20 章介绍了如何将寻址到本地主机的 IPv4 数据包传递到传输层 (L4) 协议(TCP、UDP 等)。

Chapter 20 shows how IPv4 packets addressed to the local host are delivered to the transport layer (L4) protocols (TCP, UDP, etc.).

互联网控制消息协议 (ICMP)
Internet Control Message Protocol (ICMP)

第25章描述了ICMP的实现,ICMP是本书中唯一涉及的传输层(L4)协议。

Chapter 25 describes the implementation of ICMP, the only transport layer (L4) protocol covered in the book.

邻近协议
Neighboring protocols

它们根据给定的 IP 地址查找本地网络地址。第六部分描述了各种协议的通用基础设施以及 IPv4 使用的 ARP 邻居协议的详细信息。

These find local network addresses, given their IP addresses. Part VI describes both the common infrastructure of the various protocols and the details of the ARP neighboring protocol used by IPv4.

路由
Routing

第七部分是本书中最重要的部分,描述了路由缓存和表。还涵盖了策略路由和多路径等高级功能。

Part VII, the biggest one of the book, describes the routing cache and tables. Advanced features such as Policy Routing and Multipath are also covered.

不涵盖哪些内容

What Is Not Covered

由于篇幅有限,我不得不选择 Linux 网络功能的一个子集来进行介绍。没有任何选择会让每个人都满意,但我认为我涵盖了网络代码的核心,并且利用从本书中获得的知识,您会发现自己更容易研究内核的任何其他网络功能。

For lack of space, I had to select a subset of the Linux networking features to cover. No selection would make everyone happy, but I think I covered the core of the networking code, and with the knowledge you can gain with this book, you will find it easier to study on your own any other networking feature of the kernel.

在本书中,我决定重点关注网络代码,从设备驱动程序和协议处理程序之间的接口,一直到 IPv4 和 L4 协议之间的接口。我不想在质量上妥协来覆盖所有功能,而是更愿意将质量作为首要目标,并选择代表内核网络实现之旅的最佳起点的功能子集。

In this book, I decided to focus on the networking code, from the interface between device drivers and the protocol handlers, up to the interface between the IPv4 and L4 protocols. Instead of covering all of the features with a compromise on quality, I preferred to keep quality as the first goal, and to select the subset of features that would represent the best start for a journey into the kernel networking implementation.

以下是由于篇幅有限而无法介绍的部分功能列表:

Here is a partial list of the features I could not cover for lack of space:

互联网协议版本 6 (IPv6)
Internet Protocol Version 6 (IPv6)

尽管我在书中没有涉及 IPv6,但 IPv4 的描述可以帮助您了解 IPv6 的实现。这两个协议共享函数的命名约定,并且通常还共享变量的命名约定。它们与 Netfilter 的接口也很相似。

Even though I do not cover IPv6 in the book, the description of IPv4 can help you a lot in understanding the IPv6 implementation. The two protocols share naming conventions for functions and often for variables. Their interface to Netfilter is also similar.

IP安全协议
IP Security protocol

内核提供了密码学的通用基础设施以及密码和摘要算法的集合。加密层的第一个接口是同步的,但最新的改进是添加异步接口,以允许 Linux 利用可以卸载 CPU 工作的硬件卡。

IPsec 套件的协议 — 身份验证标头 (AH)、封装安全负载 (ESP) 和 IP 压缩 (IPcomp) — 在内核中实现并利用加密层。

The kernel provides a generic infrastructure for cryptography along with a collection of both ciphers and digest algorithms. The first interface to the cryptographic layer was synchronous, but the latest improvements are adding an asynchronous interface to allow Linux to take advantage of hardware cards that can offload the work from the CPU.

The protocols of the IPsec suite—Authentication Header (AH), EncapsulatingSecurity Payload (ESP), and IP Compression (IPcomp)—are implemented in the kernel and make use of the cryptographic layer.

IP组播和IP组播路由
IP multicast and IP multicast routing

多播功能的实现符合 Internet 组管理协议 (IGMP) 版本 2 和 3。还提供组播路由支持,符合协议独立组播 (PIM) 的版本 1 和 2。

Multicast functionality was implemented to conform to versions 2 and 3 of the Internet Group Management Protocol (IGMP). Multicast routing support is also present, conforming to versions 1 and 2 of Protocol Independent Multicast (PIM).

传输层 (L4) 协议
Transport layer (L4) protocols

Linux 内核中实现了多种 L4 协议。除了两个众所周知的UDP和TCP之外,Linux还有更新的流控制传输协议(SCTP)。要很好地描述这些协议的实现,需要一本如此规模的新书,而且全部都是单独的。

Several L4 protocols are implemented in the Linux kernel. Besides the two well-known ones, UDP and TCP, Linux has the newer Stream Control Transmission Protocol (SCTP). A good description of the implementation of those protocols would require a new book of this size, all on its own.

交通管制
Traffic Control

这是 Linux 的服务质量 (QoS) 层,是内核网络代码的另一个有趣且强大的组件。流量控制作为通用基础设施以及流量分类器和排队规则的集合来实现。我在第 11 章中简要描述了它以及它为主传输例程提供的接口。http://lartc.org上提供了大量文档。

This is the Quality of Service (QoS) layer of Linux, another interesting and powerful component of the kernel's networking code. Traffic control is implemented as a general infrastructure and as a collection of traffic classifiers and queuing disciplines. I briefly describe it and the interface it provides to the main transmission routine in Chapter 11. A great deal of documentation is available at http://lartc.org.

网络过滤器
Netfilter

本书没有介绍防火墙代码基础结构及其扩展(包括各种 NAT 风格),但我描述了它与我介绍的大多数网络功能的交互。在 Netfilter 主页http://www.netfilter.org上,您可以找到一些有关其内核内部结构的有趣文档。

The firewall code infrastructure and its extensions (including the various NAT flavors) is not covered in the book, but I describe its interaction with most of the networking features I cover. At the Netfilter home page, http://www.netfilter.org, you can find some interesting documentation about its kernel internals.

网络文件系统
Network filesystems

内核中实现了多种网络文件系统,其中包括 NFS(版本 2、3 和 4)、SMB、Coda 和 Andrew。您可以在了解 Linux 内核中阅读虚拟文件系统层的详细描述,然后深入研究源代码以了解这些网络文件系统如何与其交互。

Several network filesystems are implemented in the kernel, among them NFS (versions 2, 3, and 4), SMB, Coda, and Andrew. You can read a detailed description of the Virtual File System layer in Understanding the Linux Kernel, and then delve into the source code to see how those network filesystems interface with it.

虚拟设备
Virtual devices

专用虚拟设备的使用是网络功能实现的基础。示例包括 802.1Q、绑定和各种隧道协议,例如 IP-over-IP (IPIP) 和通用路由封装 (GRE)。虚拟设备需要遵循与真实设备相同的指导原则,并为其他内核组件提供相同的接口。在不同的章节中,根据需要,我比较了真实和虚拟设备的行为。唯一详细描述的虚拟设备是桥接接口,这将在第四部分中介绍。

The use of a dedicated virtual device underlies the implementation of networking features. Examples include 802.1Q, bonding, and the various tunneling protocols, such as IP-over-IP (IPIP) and Generalized Routing Encapsulation (GRE). Virtual devices need to follow the same guidelines as real devices and provide the same interface to other kernel components. In different chapters, where needed, I compare real and virtual device behaviors. The only virtual device that is described in detail is the bridge interface, which is covered in Part IV.

DECnet、IPX、AppleTalk 等
DECnet, IPX, AppleTalk, etc.

这些具有历史根源并且仍在使用,但比 IP 更不常用。我将它们排除在外,以便为影响更多用户的主题提供更多空间。

These have historical roots and are still in use, but are much less commonly used than IP. I left them out to give more space to topics that affect more users.

IP虚拟服务器
IP virtual server

这是另一段有趣的网络代码,在 http://www.linuxvirtualserver.org/中有描述。此功能可用于使用不同的调度算法构建服务器集群。

This is another interesting piece of the networking code, described at http://www.linuxvirtualserver.org/. This feature can be used to build clusters of servers using different scheduling algorithms.

简单网络管理协议 (SNMP)
Simple Network Management Protocol (SNMP)

本书中没有专门介绍 SNMP 的章节,但对于每个功能,我都描述了内核保存的所有计数器和统计信息、用于操作它们的例程以及用于导出它们的 /proc 文件(如果可用)

No chapter in this book is dedicated to SNMP, but for each feature, I give a description of all the counters and statistics kept by the kernel, the routines used to manipulate them, and the /proc files used to export them, when available.

帧转向器
Frame Diverter

此功能允许内核劫持未发送至本地主机的入口帧。我将在第三部分简要提及。它的主页是http://diverter.sourceforge.net

This feature allows the kernel to kidnap ingress frames not addressed to the local host. I will briefly mention it in Part III. Its home page is http://diverter.sourceforge.net.

许多其他网络项目都可以作为内核的单独补丁提供,我无法在这里全部列出。我发现特别令人着迷和有前途的一个,特别是与 Linux 路由代码相关的,是高度可配置的 Click 路由器,目前在http://pdos.csail.mit.edu/click/上提供。

Plenty of other network projects are available as separate patches to the kernel, and I can't list them all here. One that I find particularly fascinating and promising, especially in relation to the Linux routing code, is the highly configurable Click router, currently offered at http://pdos.csail.mit.edu/click/.

因为这是一本关于内核的书,所以我不讨论用户空间配置工具。但是,对于每个主题,我都会描述最常见的用户空间配置工具和内核之间的接口。

Because this is a book about the kernel, I do not cover user-space configuration tools. However, for each topic, I describe the interface between the most common user-space configuration tools and the kernel.

本书中使用的约定

Conventions Used in This Book

以下是本书中使用的印刷约定的列表:

The following is a list of the typographical conventions used in this book:

斜体
Italic

用于文件和目录名称、程序和命令名称、命令行选项、URL 和新术语

Used for file and directory names, program and command names, command-line options, URLs, and new terms

Constant Width
Constant Width

在示例中用于显示文件内容或命令的输出,在文本中用于指示 C 代码或其他文字字符串中出现的单词

Used in examples to show the contents of files or the output from commands, and in the text to indicate words that appear in C code or other literal strings

Constant Width Italic
Constant Width Italic

用于指示用户用实际值替换的命令中的文本

Used to indicate text within commands that the user replaces with an actual value

Constant Width Bold
Constant Width Bold

在示例中用于显示应由用户逐字输入的命令或其他文本

Used in examples to show commands or other text that should be typed literally by the user

请特别注意使用以下图标与文本分开的注释:

Pay special attention to notes set apart from the text with the following icons:

提示

Tip

这是一个提示。它包含有关当前主题的有用补充信息。

This is a tip. It contains useful supplementary information about the topic at hand.

警告

Warning

这是一个警告。它可以帮助您解决并避免恼人的问题。

This is a warning. It helps you solve and avoid annoying problems.

使用代码示例

Using Code Examples

本书旨在帮助您完成工作。一般来说,您可以在您的程序和文档中使用本书中的代码。代码示例受 BSD/GPL 双重许可证的保护。

This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. The code samples are covered by a dual BSD/GPL license.

我们赞赏但不要求归属。归属通常包括标题、作者、出版商和 ISBN。例如:“了解 Linux 网络内部结构,作者:Christian Benvenuti。版权所有 2006 O'Reilly Media, Inc.,0-596-00255-6。”

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: "Understanding Linux Network Internals, by Christian Benvenuti. Copyright 2006 O'Reilly Media, Inc., 0-596-00255-6."

我们希望收到您的来信

We'd Like to Hear from You

请向出版商提出有关本书的意见和问题:

Please address comments and questions concerning this book to the publisher:

奥莱利媒体公司
格拉文斯坦公路北1005号
塞瓦斯托波尔, CA 95472
(800) 998-9938(美国或加拿大)
(707) 829-0515(国际或本地)
(707) 829-0104(传真)

我们有本书的网页,其中列出了勘误表、示例和任何其他信息。您可以通过以下地址访问此页面:

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at:

http://www.oreilly.com/catalog/understandlni/

要评论或询问有关本书的技术问题,请发送电子邮件至:

To comment or ask technical questions about this book, send email to:

有关我们的书籍、会议、资源中心和 O'Reilly Network 的更多信息,请访问我们的网站:

For more information about our books, conferences, Resource Centers, and the O'Reilly Network, see our web site at:

http://www.oreilly.com

Safari 启用

Safari Enabled

当您在最喜欢的技术书籍的封面上看到“Safari® 已启用”图标时,这意味着该书可通过 O'Reilly Network Safari 书架在线获取。

When you see a Safari® Enabled icon on the cover of your favorite technology book, that means the book is available online through the O'Reilly Network Safari Bookshelf.

Safari 提供了比电子书更好的解决方案。它是一个虚拟图书馆,可让您轻松搜索数千本顶级技术书籍、剪切和粘贴代码示例、下载章节,并在需要最准确的最新信息时快速找到答案。请访问http://safari.oreilly.com免费试用。

Safari offers a solution that's better than e-books. It's a virtual library that lets you easily search thousands of top tech books, cut and paste code samples, download chapters, and find quick answers when you need the most accurate, current information. Try it for free at http://safari.oreilly.com.

致谢

Acknowledgments

如果没有有趣的话题和读者,这本书就不可能出版。有趣的话题是 Linux,这个任何人都有机会参与的现代操作系统,其受众是数量惊人的用户,他们经常决定不仅要利用他人的优秀工作,还要为其做出贡献。通过参与其开发而获得成功。我一直喜欢分享知识和对我喜欢的事物的热情,通过这本书,我尽力在高速公路上添加一两条车道,将感兴趣的人带入 Linux 内核的奇妙世界。

This book would not have been possible without an interesting topic to talk about, and an audience. The interesting topic is Linux, this modern operating system that anyone has an opportunity to be part of, and the audience is the incredible number of users that often decide not only to take advantage of the good work of others, but also to contribute to its success by getting involved in its development. I have always loved sharing knowledge and passion for the things I like, and with this book, I have tried my best to add a lane or two to the highway that takes interested people into the wonderful world of the Linux kernel.

当然,我并不是躺在海边的吊床上,一手拿着冰淇淋,另一只手拿着鼠标,做所有事情。我们花费了大量的工作来调查一些实施选择背后的原因。令人难以置信的是,当您对他们的工作表现出真正的兴趣时,您可以从开发邮件列表中挖掘出多少信息,以及有多少人愿意分享他们的知识。

Of course, I did not do everything while lying in a hammock by the beach, with an ice cream in one hand and a mouse in the other. It took quite a lot of work to investigate the reasons behind some of the implementation choices. It is incredible how much information you can dig out of the development mailing lists, and how much people are willing to share their knowledge when you show genuine interest in their work.

当然,如果没有我的编辑安迪·奥拉姆(Andy Oram)的大力帮助和建议,这本书就不会是现在的样子。由于网络代码经常发生变化,在本书的写作过程中,一些章节必须进行大量更新,但安迪理解这一点并帮助我完成了任务。

For sure, this book would not be what it is without the great help and suggestions of my editor, Andy Oram. Due to the frequent changes that the networking code experiences, a few chapters had to undergo substantial updates during the writing of the book, but Andy understood this and helped me get to the finish line.

我还要感谢所有在这项工作中支持我的人,并感谢思科系统公司为我提供了编写本书所需的灵活性。

I also would like to thank all of those people that supported me in this effort, and Cisco Systems for giving me the flexibility I needed to work on this book.

还要特别感谢技术审稿人,他们能够在短时间内审阅如此规模的书籍,并且仍然提供有用的评论,使我能够发现错误并提高材料的质量。该书由 Jerry Cooperstein、Michael Boerner 和 Paul Kinzelman(按名字字母顺序排列)审阅。我还要感谢 Francois Tallet 对第四部分的审阅,以及 Andi Kleen 对第五部分的反馈。

A special thanks also goes to the technical reviewers for being able to review a book of this size in a short amount of time, still providing useful comments that allowed me to catch errors and improve the quality of the material. The book was reviewed by Jerry Cooperstein, Michael Boerner, and Paul Kinzelman (in alphabetical order, by first name). I also would like to thank Francois Tallet for reviewing Part IV and Andi Kleen for his feedback on Part V.

第一部分:一般背景

Part I. General Background

本书这一部分中的信息代表了您轻松理解本书其余部分所需的基础知识。如果您已经熟悉 Linux 内核,或者您是一位经验丰富的软件工程师,您将能够很快地完成这些章节。对于其他读者,我建议在继续阅读本书的以下部分之前先熟悉这些材料:

The information in this part of the book represents the basic knowledge you need to understand the rest of the book comfortably. If you are already familiar with the Linux kernel, or you are an experienced software engineer, you will be able to go pretty quickly through these chapters. For other readers, I suggest getting familiar with this material before proceeding with the following parts of the book:

第一章 简介
Chapter 1 Introduction

本章的大部分内容致力于介绍一些您在网络代码中经常遇到的常见编程模式和技巧。

The bulk of this chapter is devoted to introducing a few of the common programming patterns and tricks that you'll often meet in the networking code.

第 2 章 关键数据结构
Chapter 2 Critical Data Structures

在本章中,您可以找到网络代码使用的两个最重要的数据结构的详细描述:套接字缓冲区sk_buff和网络设备net_device

In this chapter, you can find a detailed description of two of the most important data structures used by the networking code: the socket buffer sk_buff and the network device net_device.

第 3 章 用户空间到内核的接口
Chapter 3 User-Space-to-Kernel Interface

本书对每个功能的讨论都以一组展示用户空间配置工具和内核如何通信的部分结束。本章中的信息可以帮助您更好地理解这些部分。

The discussion of each feature in this book ends with a set of sections that shows how user-space configuration tools and the kernel communicate. The information in this chapter can help you understand those sections better.

第一章简介

Chapter 1. Introduction

研究一个大项目的源代码就等于进入了一片陌生的新土地,有它自己的风土人情和不言而喻的期望。预先了解一些主要习俗并尝试与居民互动而不是仅仅站在后面观察是很有用的。

To do research in the source code of a large project is to enter a strange, new land with its own customs and unspoken expectations. It is useful to learn some of the major conventions up front, and to try interacting with the inhabitants instead of merely standing back and observing.

本章的大部分内容致力于向您介绍一些在网络代码中经常遇到的常见编程模式和技巧。

The bulk of this chapter is devoted to introducing you to a few of the common programming patterns and tricks that you'll often meet in the networking code.

我鼓励您在可能的情况下尝试通过用户空间工具与内核网络代码的给定部分进行交互。因此,在本章中,如果您首选的 Linux 发行版上尚未安装这些工具,或者您只是想将它们升级到最新版本,那么在本章中,我将向您提供一些指导,告诉您可以在哪里下载这些工具。

I encourage you, when possible, to try interacting with a given part of the kernel networking code by means of user-space tools. So in this chapter, I'll give you a few pointers as to where you can download those tools if they're not already installed on your preferred Linux distribution, or if you simply want to upgrade them to the latest versions.

我还将描述一些工具,让您能够优雅地浏览庞大的内核代码。最后,我将简要解释为什么内核功能可能不会集成到官方内核版本中,即使它在 Linux 社区中被广泛使用。

I'll also describe some tools that let you find your way gracefully through the enormous kernel code. Finally, I'll explain briefly why a kernel feature may not be integrated into the official kernel releases, even if it is widely used in the Linux community.

基本术语

Basic Terminology

在本节中,我将介绍本书中将广泛使用的术语和缩写。

In this section, I'll introduce terms and abbreviations that are going to be used extensively in this book.

八位量通常称为八位组 在网络文献中。然而,在本书中,我使用了更熟悉的术语“字节”。毕竟,本书描述的是内核的行为而不是一些网络抽象,而内核开发人员习惯于以字节为单位进行思考

Eight-bit quantities are normally called octets in the networking literature. In this book, however, I use the more familiar term byte. After all, the book describes the behavior of the kernel rather than some network abstraction, and kernel developers are used to thinking in terms of bytes .

术语向量数组可以互换使用。

The terms vector and array will be used interchangeably.

当提到 TCP/IP 网络堆栈的各层时,我将使用缩写 L2、L3 和 L4 分别指代链路层、网络层和传输层。这些数字基于著名的(如果不是当前的)七层 OSI 模型。在大多数情况下,L2 是以太网的同义词,L3 是 IP 版本 4 或 6 的同义词,L4 是 UDP、TCP 或 ICMP 的同义词。当我需要引用特定协议时,我将使用其名称(即 TCP)而不是通用的Ln 协议术语。

When referring to the layers of the TCP/IP network stack, I will use the abbreviations L2, L3, and L4 to refer to the link, network, and transport layers, respectively. The numbers are based on the famous (if not exactly current) seven-layer OSI model. In most cases, L2 will be a synonym for Ethernet, L3 for IP Version 4 or 6, and L4 for UDP, TCP, or ICMP. When I need to refer to a specific protocol, I'll use its name (i.e., TCP) rather than the generic Ln protocol term.

在不同的章节中,我们将看到网络堆栈中给定层的协议如何接收和传输数据单元。在这些情况下,术语 “入口”“输入”将互换使用。这同样适用于出口输出。接收或发送数据单元的动作可以分别用缩写RX和TX来指代。

In different chapters, we will see how data units are received and transmitted by the protocols that sit at a given layer in the network stack. In those contexts, the terms ingress and input will be used interchangeably. The same applies to egress and output. The action of receiving or transmitting a data unit may be referred to with the abbreviations RX and TX, respectively.

数据单元被赋予不同的名称,例如framepacketsegmentmessage,具体取决于使用它的层(更多详细信息请参阅第13章)。表 1-1总结了您将在书中看到的主要缩写。

A data unit is given different names, such as frame, packet, segment, and message, depending on the layer where it is used (see Chapter 13 for more details). Table 1-1 summarizes the major abbreviations you'll see in the book.

表 1-1。本书中常用的缩写

Table 1-1. Abbreviations used frequently in this book

缩写

Abbreviation

意义

Meaning

L2

L2

链路层(例如以太网)

Link layer (e.g., Ethernet)

L3

L3

网络层(例如IP)

Network layer (e.g., IP)

L4

L4

传输层(例如UDP/TCP/ICMP)

Transport layer (e.g., UDP/TCP/ICMP)

乙肝

BH

下半区

Bottom half

中断请求

IRQ

打断

Interrupt

接收

RX

接待

Reception

TX

TX

传播

Transmission

常见的编码模式

Common Coding Patterns

每个网络功能与任何其他内核功能一样,只是内核中的公民之一。因此,它必须正确、公平地使用内存、CPU 和所有其他共享资源。大多数功能并不是作为独立的内核代码片段编写的,而是根据功能或多或少地与其他内核组件交互。因此,他们尽可能地遵循类似的机制来实现类似的功能(不需要每次都重新发明轮子)。

Each networking feature, like any other kernel feature, is just one of the citizens inside the kernel. As such, it must make proper and fair use of memory, CPU, and all other shared resources. Most features are not written as standalone pieces of kernel code, but interact with other kernel components more or less heavily depending on the feature. They therefore try, as much as possible, to follow similar mechanisms to implement similar functionalities (there is no need to reinvent the wheel every time).

一些要求对于多个内核组件是通用的,例如需要分配相同数据结构类型的多个实例,需要跟踪对数据结构实例的引用以避免不安全的内存释放,等等。在以下小节中,我们将了解 Linux 中处理此类需求的常见方法。我还将讨论您在浏览内核代码时可能遇到的常见编码技巧。

Some requirements are common to several kernel components, such as the need to allocate several instances of the same data structure type, the need to keep track of references to an instance of a data structure to avoid unsafe memory deallocations, etc. In the following subsections, we will view common ways in Linux to handle such requirements. I will also talk about common coding tricks that you may come across while browsing the kernel's code.

本书使用子系统作为一个松散的术语来描述实现一组主要功能(例如 IP 或路由)的文件集合,并且这些文件往往由同一个人维护并同步变化。在本章的其余部分中,我还将使用术语“内核组件” 引用这些子系统,因为这里讨论的约定适用于内核的大部分部分,而不仅仅是涉及网络的部分。

This book uses subsystem as a loose term to describe a collection of files that implement a major set of features—such as IP or routing—and that tend to be maintained by the same people and to change in lockstep. In the rest of the chapter, I'll also use the term kernel component to refer to these subsystems, because the conventions discussed here apply to most parts of the kernel, not just those involved in networking.

内存缓存

Memory Caches

内核使用kmallockfree函数分别分配和释放内存块。这两个函数的语法类似于libc 用户空间库中的两个姐妹调用malloc和。有关和 的更多详细信息,请参阅Linux 设备驱动程序(O'Reilly)。freekmallockfree

The kernel uses the kmalloc and kfree functions to allocate and free a memory block, respectively. The syntax of those two functions is similar to that of the two sister calls, malloc and free, from the libc user-space library. For more details on kmalloc and kfree, please refer to Linux Device Drivers (O'Reilly).

内核组件分配相同数据结构类型的多个实例是很常见的。当分配和释放预计经常发生时,相关的内核组件初始化例程(例如,fib_hash_init用于路由表)通常会分配一个特殊的内存缓存来用于分配。当内存块被释放时,它实际上返回到分配它的同一缓存中。

It is common for a kernel component to allocate several instances of the same data structure type. When allocation and deallocation are expected to happen often, the associated kernel component initialization routine (for example, fib_hash_init for the routing table) usually allocates a special memory cache that will be used for the allocations. When a memory block is freed, it is actually returned to the same cache from which it was allocated.

内核为其维护专用内存缓存的网络数据结构的一些示例 包括:

Some examples of network data structures for which the kernel maintains dedicated memory caches include:

套接字缓冲区描述符
Socket buffer descriptors

该缓存由 net/core/sk_buff.c 分配skb_init用于缓冲区sk_buff描述符的分配。该 sk_buff结构可能是在网络子系统中注册最多数量的分配和释放的结构。

This cache, allocated by skb_init in net/core/sk_buff.c, is used for the allocation of sk_buff buffer descriptors. The sk_buff structure is probably the one that registers the highest number of allocations and deallocations in the networking subsystem.

相邻协议映射
Neighboring protocol mappings

每个相邻协议都使用内存缓存来分配存储 L3 到 L2 地址映射的数据结构。参见第 27 章

Each neighboring protocol uses a memory cache to allocate the data structures that store L3-to-L2 address mappings. See Chapter 27.

路由表
Routing tables

路由代码使用两个内存缓存来存储定义路由的两个数据结构。参见第 32 章

The routing code uses two memory caches for two of the data structures that define routes. See Chapter 32.

以下是用于处理内存缓存的关键内核函数:

Here are the key kernel functions used to deal with memory caches:

kmem_cache_create
kmem_cache_create

kmem_cache_destroy
kmem_cache_destroy

创建和销毁缓存。

Create and destroy a cache.

kmem_cache_alloc
kmem_cache_alloc

kmem_cache_free
kmem_cache_free

分配缓冲区并将其返回到缓存。它们通常通过包装器调用,包装器在更高级别管理分配和释放请求。sk_buff例如,释放缓冲区实例的请求kfree_skbkmem_cache_free当所有对缓冲区的引用都已被释放并且所有必要的清理已由感兴趣的子系统(例如防火墙)完成时才会结束调用。

Allocate and return a buffer to the cache. They are usually called via wrappers, which manage the requests for allocation and deallocation at a higher level. For example, the request to free an instance of an sk_buff buffer with kfree_skb ends up calling kmem_cache_free only when all the references to the buffer have been released and all the necessary cleanup has been done by the interested subsystems (for instance, the firewall).

可以从给定缓存(如果存在)分配的实例数量的限制通常由 周围的包装器强制执行,有时可以使用/prockmem_cache_alloc中的参数进行配置 。

The limit on the number of instances that can be allocated from a given cache (when present) is usually enforced by the wrappers around kmem_cache_alloc, and are sometimes configurable with a parameter in /proc.

有关如何实现内存缓存以及它们如何与slab分配器接口的更多详细信息,请参阅了解Linux内核 (O'Reilly)。

For more details on how memory caches are implemented and how they interface to the slab allocator, please refer to Understanding the Linux Kernel (O'Reilly).

缓存和哈希表

Caching and Hash Tables

使用缓存来提高性能是很常见的。在网络代码中,有用于L3到L2映射的缓存(例如IPv4使用的ARP缓存),用于路由表缓存等。

It is pretty common to use a cache to increase performance. In the networking code, there are caches for L3-to-L2 mappings (such as the ARP cache used by IPv4), for the routing table cache, etc.

缓存查找例程通常采用一个输入参数,该参数表示缓存未命中是否应该创建新元素并将其添加到缓存中。其他查找例程只是简单地添加缺失的元素。

Cache lookup routines often take an input parameter that says whether a cache miss should or should not create a new element and add it to the cache. Other lookup routines simply add missing elements all the time.

缓存通常使用哈希表来实现 。内核提供了一组数据类型,例如单向和双向列表,可以用作简单哈希表的构建块。

Caches are often implemented with hash tables . The kernel provides a set of data types, such as one-way and bidirectional lists, that can be used as building blocks for simple hash tables.

处理哈希为相同值的输入的标准方法是将它们放入列表中。遍历此列表比使用哈希键进行查找所需的时间要长得多。因此,最小化哈希为相同值的输入数量始终很重要。

The standard way to handle inputs that hash to the same value is to put them in a list. Traversing this list takes substantially longer than using the hash key to do a lookup. Therefore, it is always important to minimize the number of inputs that hash to the same value.

当哈希表上的查找时间(无论是否使用缓存)对于所有者子系统来说是一个关键参数时,它可能会实现一种机制来增加哈希表的大小,以便减少冲突列表的平均长度并且平均查找时间得到改善。有关示例,请参阅第 34 章中的“动态调整每个网络掩码哈希表的大小”部分。

When the lookup time on a hash table (whether it uses a cache or not) is a critical parameter for the owner subsystem, it may implement a mechanism to increase the size of the hash table so that the average length of the collision lists goes down and the average lookup time improves. See the section "Dynamic resizing of per-netmask hash tables" in Chapter 34 for an example.

您还可能会发现子系统(例如相邻层)将随机组件(定期更改)添加到用于在缓存存储桶中分配元素的密钥。这用于减少旨在将哈希表的元素集中到单个存储桶中的拒绝服务 (DoS) 攻击的损害。有关示例,请参阅第 27 章中的“缓存”部分。

You may also find subsystems, such as the neighboring layer, that add a random component (regularly changed) to the key used to distribute elements in the cache's buckets. This is used to reduce the damage of Denial of Service (DoS) attacks aimed at concentrating the elements of a hash table into a single bucket. See the section "Caching" in Chapter 27 for an example.

参考计数

Reference Counts

当一段代码试图访问一个已经被释放的数据结构时,内核会不太高兴,而用户也很少会对内核的反应感到满意。为了避免这些令人讨厌的问题,并使垃圾收集机制更容易、更有效(请参阅“垃圾收集”部分)“在本章后面),大多数数据结构都会保留引用计数。优秀的内核公民每次保存和释放对结构的引用时都会分别增加和减少每个数据结构的引用计数。对于任何需要的数据结构类型引用计数,拥有该结构的内核组件通常导出两个可用于递增和递减引用计数的函数。这些函数通常分别称为 和。有时会调用release函数(例如,对于结构)。xxx _holdxxx _releasexxx _putdev_putnet_device

When a piece of code tries to access a data structure that has already been freed, the kernel is not very happy, and the user is rarely happy with the kernel's reaction. To avoid those nasty problems, and to make garbage collection mechanisms easier and more effective (see the section "Garbage Collection" later in this chapter), most data structures keep a reference count. Good kernel citizens increment and decrement the reference count of every data structure every time they save and release a reference, respectively, to the structure. For any data structure type that requires a reference count, the kernel component that owns the structure usually exports two functions that can be used to increment and decrement the reference count. Such functions are usually called xxx _hold and xxx _release, respectively. Sometimes the release function is called xxx _put instead (e.g., dev_put for net_device structures).

虽然我们喜欢假设内核中没有坏人,但开发人员也是人,因此他们并不总是编写没有错误的代码。使用引用计数是一种简单但有效的机制,可以避免释放仍然引用的数据结构。然而,它并不总能完全解决问题。这是忘记平衡增量和减量的结果:

While we like to assume there are no bad citizens in the kernel, developers are human, and as such they do not always write bug-free code. The use of the reference count is a simple but effective mechanism to avoid freeing still-referenced data structures. However, it does not always solve the problem completely. This is the consequence of forgetting to balance increments and decrements:

  • 如果您释放对数据结构的引用但忘记调用该 函数,则内核将永远不会允许释放该数据结构(除非另一段有缺陷的代码恰好错误地额外调用了释放函数!)。这会导致内存逐渐耗尽。xxx _release

  • If you release a reference to a data structure but forget to call the xxx _release function, the kernel will never allow the data structure to be freed (unless another buggy piece of code happens to call the release function an extra time by mistake!). This leads to gradual memory exhaustion.

  • 如果您引用了一个数据结构,但忘记调用 ,并且在稍后的某个时刻您恰好是唯一的引用持有者,则该结构将被过早释放,因为您没有被考虑在内。这起案件肯定比前一起更具灾难性;您下次尝试访问该结构可能会损坏其他数据或导致内核恐慌,从而立即导致整个系统崩溃。xxx _hold

  • If you take a reference to a data structure but forget to call xxx _hold, and at some later point you happen to be the only reference holder, the structure will be prematurely freed because you are not accounted for. This case definitely can be more catastrophic than the previous one; your next attempt to access the structure can corrupt other data or cause a kernel panic that brings down the whole system instantly.

当由于某种原因要删除数据结构时,可以明确通知引用持有者该数据结构已消失,以便他们可以礼貌地释放其引用。这是通过通知链完成的。有关有趣的示例,请参阅第 8 章中的“引用计数”部分。

When a data structure is to be removed for some reason, the reference holders can be explicitly notified about its going away so that they can politely release their references. This is done through notification chains. See the section "Reference Counts" in Chapter 8 for an interesting example.

数据结构上的引用计数通常可以在以下情况下增加:

The reference count on a data structure typically can be incremented when:

  • 两种数据结构类型之间存在密切的关系。在这种情况下,两者之一通常维护一个初始化为第二个地址的指针。

  • There is a close relationship between two data structure types. In this case, one of the two often maintains a pointer initialized to the address of the second one.

  • 计时器启动,其处理程序将访问数据结构。当计时器被触发时,结构上的引用计数会增加,因为您最不希望发生的事情是在计时器到期之前释放数据结构。

  • A timer is started whose handler is going to access the data structure. When the timer is fired, the reference count on the structure is incremented, because the last thing you want is for the data structure to be freed before the timer expires.

  • 对列表或哈希表的成功查找将返回指向匹配元素的指针。在大多数情况下,调用者使用返回的结果来执行某些任务。因此,查找例程通常会增加匹配元素的引用计数,并让调用者在必要时释放它。

  • A successful lookup on a list or a hash table returns a pointer to the matching element. In most cases, the returned result is used by the caller to carry out some task. Because of that, it is common for a lookup routine to increase the reference count on the matching element, and let the caller release it when necessary.

当对数据结构的最后一个引用被释放时,它可能会被释放,因为不再需要它,但不一定。

When the last reference to a data structure is released, it may be freed because it is not needed anymore, but not necessarily.

新sysfs文件系统的引入有助于使大部分内核代码更加了解引用计数并使其使用保持一致。

The introduction of the new sysfs filesystem has helped to make a good portion of the kernel code more aware of reference counts and consistent in its use of them.

垃圾收集

Garbage Collection

内存是一种共享且有限的资源,不应浪费,特别是在内核中,因为它不使用虚拟内存。大多数内核子系统都实现某种垃圾收集回收未使用或过时的数据结构实例所持有的内存。根据任何给定功能的需求,您会发现两种主要的垃圾收集:

Memory is a shared and limited resource and should not be wasted, particularly in the kernel because it does not use virtual memory. Most kernel subsystems implement some sort of garbage collection to reclaim the memory held by unused or stale data structure instances. Depending on the needs of any given feature, you will find two main kinds of garbage collection:

异步
Asynchronous

这种类型的垃圾收集与特定事件无关。定期到期的计时器会调用一个例程,该例程会扫描一组数据结构并释放那些被认为符合删除条件的数据结构。使数据结构符合删除条件的条件取决于子系统的功能和逻辑,但常见的标准是存在空引用计数。

This type of garbage collection is unrelated to particular events. A timer that expires regularly invokes a routine that scans a set of data structures and frees the ones considered eligible for deletion. The conditions that make a data structure eligible for deletion depend on the features and logic of the subsystem, but a common criterion is the presence of a null reference count.

同步
Synchronous

存在内存不足的情况,无法等待异步垃圾回收计时器启动,立即触发垃圾收集。用于选择符合删除条件的数据结构的标准不一定与异步清理所使用的标准相同(例如,它们可能更积极)。请参阅第 33 章的示例。

There are cases where a shortage of memory, which cannot wait for the asynchronous garbage collection timer to kick in, triggers immediate garbage collection. The criteria used to select the data structures eligible for deletion are not necessarily the same ones used by asynchronous cleanup (for instance, they could be more aggressive). See Chapter 33 for an example.

第 7 章中,您将看到内核如何设法回收初始化例程使用的内存,并且在执行初始化例程后不再需要这些内存。

In Chapter 7, you will see how the kernel manages to reclaim the memory used by initialization routines and that is no longer needed after they have been executed.

函数指针和虚拟函数表 (VFT)

Function Pointers and Virtual Function Tables (VFTs)

函数指针是一种编写干净的 C 代码的便捷方法,同时还能获得面向对象语言的一些好处。在数据结构类型(对象)的定义中,包含一组函数指针(方法)。然后通过嵌入函数完成结构的部分或全部操作。数据结构中的 C 语言函数指针如下所示:

Function pointers are a convenient way to write clean C code while getting some of the benefits of the object-oriented languages. In the definition of a data structure type (the object), you include a set of function pointers (the methods). Some or all manipulations of the structure are then done through the embedded functions. C-language function pointers in data structures look like this:

结构袜子{
    ...
    void (*sk_state_change)(struct sock *sk);
    void (*sk_data_ready)(struct sock *sk, int bytes);
    ...

};
struct sock {
    ...
    void    (*sk_state_change)(struct sock *sk);
    void    (*sk_data_ready)(struct sock *sk, int bytes);
    ...

};

使用函数指针的一个关键优点是,它们可以根据不同的标准和对象所扮演的角色进行不同的初始化。因此,调用sk_state_change实际上可能为不同的sock套接字调用不同的函数。

A key advantage to using function pointers is that they can be initialized differently depending on various criteria and the role played by the object. Thus, invoking sk_state_change may actually invoke different functions for different sock sockets.

函数指针在网络代码中广泛使用。以下仅举几个例子:

Function pointers are used extensively in the networking code. The following are only a few examples:

  • 当路由子系统处理入口或出口数据包时,它会初始化缓冲区数据结构中的两个例程。您将在第 35 章中看到这一点。有关数据结构中包含的函数指针的完整列表,请参阅第 2 章sk_buff

  • When an ingress or egress packet is processed by the routing subsystem, it initializes two routines in the buffer data structure. You will see this in Chapter 35. Refer to Chapter 2 for a complete list of function pointers included in the sk_buff data structure.

  • 当数据包准备好在网络硬件上传输时,它将被传递给数据结构hard_start_xmit的函数指针 net_device。该例程由与设备关联的设备驱动程序初始化。

  • When a packet is ready for transmission on the networking hardware, it is handed to the hard_start_xmit function pointer of the net_device data structure. That routine is initialized by the device driver associated with the device.

  • 当 L3 协议想要传输数据包时,它会调用一组函数指针中的一个。这些已通过与 L3 协议关联的地址解析协议初始化为一组例程。根据函数指针初始化的实际例程,可能会发生透明的 L3 到 L2 地址解析(例如,IPv4 数据包通过 ARP)。当不需要地址解析时,使用不同的例程。有关此接口的详细讨论,请参阅第六部分。

  • When an L3 protocol wants to transmit a packet, it invokes one of a set of function pointers. These have been initialized to a set of routines by the address resolution protocol associated with the L3 protocol. Depending on the actual routine to which the function pointer is initialized, a transparent L3-to-L2 address resolution may take place (for example, IPv4 packets go through ARP). When the address resolution is unnecessary, a different routine is used. See Part VI for a detailed discussion on this interface.

我们在前面的示例中看到了如何将函数指针用作内核组件之间的接口或用作通用机制,以根据不同子系统完成的操作的结果在正确的时间调用正确的函数处理程序。在某些情况下,函数指针也被用作允许协议、设备驱动程序或任何其他功能个性化操作的简单方法。

We see in the preceding examples how function pointers can be employed as interfaces between kernel components or as generic mechanisms to invoke the right function handler at the right time based on the result of something done by a different subsystem. There are cases where function pointers are also used as a simple way to allow protocols, device drivers, or any other feature to personalize an action.

让我们看一个例子。当设备驱动程序向内核注册网络设备时,无论设备类型如何,它都会执行一系列所需的步骤。在某些时候,它会调用数据结构上的函数指针net_device,让设备驱动程序在需要时执行一些额外的操作。设备驱动程序可以将该函数指针初始化为它自己的函数,也可以将该指针保留为 NULL,因为内核执行的默认步骤就足够了。

Let's look at an example. When a device driver registers a network device with the kernel, it goes through a series of steps that are needed regardless of the device type. At some point, it invokes a function pointer on the net_device data structure to let the device driver do something extra if needed. The device driver could either initialize that function pointer to a function of its own, or leave the pointer NULL because the default steps performed by the kernel are sufficient.

在执行函数指针之前始终需要检查函数指针的值,以避免 NULL 指针取消引用,如下面的快照所示register_netdevice

A check on the value of a function pointer is always necessary before executing it to avoid NULL pointer dereferences, as shown in this snapshot from register_netdevice:

    if (dev->init && dev->init(dev) != 0) {
        ...
    }
    if (dev->init && dev->init(dev) != 0) {
        ...
    }

函数指针有一个主要缺点:它们使浏览源代码变得有点困难。在浏览给定的代码路径时,您可能最终会关注函数指针调用。在这种情况下,在继续执行代码路径之前,您需要了解函数指针是如何初始化的。这可能取决于不同的因素:

Function pointers have one main drawback: they make browsing the source code a little harder. While going through a given code path, you may end up focusing on a function pointer call. In such cases, before proceeding down the code path, you need to find out how the function pointer has been initialized. It could depend on different factors:

  • 当分配给函数指针的例程的选择基于特定的数据块时,例如处理数据的协议或接收给定数据包的设备驱动程序,则更容易导出例程。例如,如果给定设备由drivers/net/3c59x.c设备驱动程序管理,则可以net_device通过读取设备驱动程序提供的设备初始化例程来派生初始化数据结构的给定函数指针的例程。

  • When the selection of the routine to assign to a function pointer is based on a particular piece of data, such as the protocol handling the data or the device driver a given packet is received from, it is easier to derive the routine. For example, if a given device is managed by the drivers/net/3c59x.c device driver, you can derive the routine to which a given function pointer of the net_device data structure is initialized by reading the device initialization routine provided by the device driver.

  • 当例程的选择是基于更复杂的逻辑时,例如L3到L2地址映射的解析状态,则任何时候使用的例程都取决于无法预测的外部事件。

  • When the selection of the routine is based instead on more complex logic, such as the state of the resolution of an L3-to-L2 address mapping, the routine used at any time depends on external events that cannot be predicted.

分组到数据结构中的一组函数指针通常称为 虚拟函数表(VFT)。当 VFT 用作两个主要子系统(例如 L3 和 L4 协议层)之间的接口时,或者当 VFT 简单地导出为通用内核组件(对象集)的接口时,其中的函数指针数量可能会膨胀以包含许多不同的指针,以适应各种协议或其他功能。每个功能最终可能只使用所提供的众多功能中的几个。您将在第六部分中看到一个示例。当然,如果 VFT 的使用太过分,它就会变得很麻烦,并且需要进行重大的重新设计。

A set of function pointers grouped into a data structure are often referred to as a virtual function table (VFT). When a VFT is used as the interface between two major subsystems, such as the L3 and L4 protocol layers, or when the VFT is simply exported as an interface to a generic kernel component (set of objects), the number of function pointers in it may swell to include many different pointers that accommodate a wide range of protocols or other features. Each feature may end up using only a few of the many functions provided. You will see an example in Part VI. Of course, if this use of a VFT is taken too far, it becomes cumbersome and a major redesign is needed.

转到语句

goto Statements

很少有 C 程序员喜欢这种goto说法。在不深入讨论goto(计算机编程中最长、最著名的争议之一)的历史的情况下,我将总结一些通常goto不赞成使用它的原因,以及为什么 Linux 内核仍然使用它。

Few C programmers like the goto statement. Without getting into the history of the goto (one of the longest and most famous controversies in computer programming), I'll summarize some of the reasons the goto is usually deprecated, but why the Linux kernel uses it anyway.

任何使用它的代码都goto可以在没有它的情况下重写。使用goto语句会降低代码的可读性,并使调试变得更加困难,因为在 a 之后的任何位置, goto您都无法再明确地导出导致执行到该点的条件。

Any piece of code that uses goto can be rewritten without it. The use of goto statements can reduce the readability of the code, and make debugging harder, because at any position following a goto you can no longer derive unequivocally the conditions that led the execution to that point.

让我打个比方:给定树中的任何节点,你都知道从根到该节点的路径是什么。但是,如果您添加随机缠绕在树枝上的藤蔓,则根和其他节点之间不再总是有唯一的路径。

Let me make this analogy: given any node in a tree, you know what the path from the root to the node is. But if you add vines that entwine around branches randomly, you do not always have a unique path between the root and the other nodes anymore.

然而,由于 C 语言不提供显式异常(并且由于性能影响和编码复杂性,在其他语言中通常也会避免它们),仔细放置语句可以更容易地跳转到处理不需要的或特殊事件的代码goto。在内核编程中,特别是在网络中,此类事件非常常见,因此goto成为一种方便的工具。

However, because the C language does not provide explicit exceptions (and they are often avoided in other languages as well because of the performance hit and coding complexity), carefully placed goto statements can make it easier to jump to code that handles undesired or peculiar events. In kernel programming, and particularly in networking, such events are very common, so goto becomes a convenient tool.

我必须为内核的使用辩护goto,指出开发人员绝没有对它疯狂。尽管有超过 30,000 个实例,但它们主要用于处理函数内的不同返回码,或者跳出多层嵌套。

I must defend the kernel's use of goto by pointing out that developers have by no means gone wild with it. Even though there are more than 30,000 instances, they are mainly used to handle different return codes within a function, or to jump out of more than one level of nesting.

矢量定义

Vector Definitions

在某些情况下,数据结构的定义末尾包含可选块。这是一个例子:

In some cases, the definition of a data structure includes an optional block at the end. This is an example:

结构 abc {
    年龄;
    字符*名称[20];
    ...
    字符占位符[0];
}
struct abc {
    int age;
    char *name[20];
    ...
    char    placeholder[0];
}

可选块以 开头placeholder。请注意,placeholder被定义为大小为 0 的向量。这意味着当abc与可选块一起分配时,placeholder指向块的开头。当不需要可选块时,placeholder只是指向结构末尾的指针;它不占用任何空间。

The optional block starts with placeholder. Note that placeholder is defined as a vector of size 0. This means that when abc is allocated with the optional block, placeholder points to the beginning of the block. When no optional block is required, placeholder is just a pointer to the end of the structure; it does not consume any space.

因此,如果abc被多段代码使用,则每段代码都可以使用相同的基本定义(避免以稍微不同的方式做同一件事的混乱),同时进行abc 不同的扩展以根据其需要个性化其定义。

Thus, if abc is used by several pieces of code, each one can use the same basic definition (avoiding the confusion of doing the same thing in slightly different ways) while extending abc differently to personalize its definition according to its needs.

我们会在书中多次看到这种数据结构的定义。第 19 章就是一个例子。

We will see this kind of data structure definition a few times in the book. One example is in Chapter 19.

条件指令(#ifdef 及其系列)

Conditional Directives (#ifdef and family)

有时需要向编译器提供条件指令。过度使用它们会降低代码的可读性,但我可以声明 Linux 不会滥用它们。它们出现的原因各不相同,但我们感兴趣的是那些用于检查内核是否支持给定功能的内容。配置工具(例如make xconfig) 确定该功能是否已编译、根本不支持或可作为模块加载。

Conditional directives to the compiler are sometimes necessary. An excessive use of them can reduce the readability of the code, but I can state that Linux does not abuse them. They appear for different reasons, but the ones we are interested in are those used to check whether a given feature is supported by the kernel. Configuration tools such as make xconfig determine whether the feature is compiled in, not supported at all, or loadable as a module.

C 预处理器指令的功能检查示例#ifdef包括#if defined

Examples of feature checks by #ifdef or #if defined C preprocessor directives are:

在数据结构定义中包含或排除字段
To include or exclude fields from a data structure definition
结构体sk_buff {
    ...
#ifdef CONFIG_NETFILTER_DEBUG
    无符号整数nf_debug;
#万一
    ...
}

在此示例中,Netfilter 调试功能需要结构nf_debug中的一个字段sk_buff。当内核不支持 Netfilter 调试(只有少数开发人员需要的功能)时,无需包含该字段,这只会为每个网络数据包占用更多内存。

struct sk_buff {
    ...
#ifdef CONFIG_NETFILTER_DEBUG
    unsigned int nf_debug;
#endif
    ...
}

In this example, the Netfilter debugging feature requires an nf_debug field in the sk_buff structure. When the kernel does not have support for Netfilter debugging (a feature needed by only a handful of developers), there is no need to include the field, which would just take up more memory for every network packet.

在函数中包含或排除代码片段
To include or exclude pieces of code from a function
int ip_route_input(...)
{
    ...
        if (rth->fl.fl4_dst == baddr &&
            rth->fl.fl4_src == Saddr &&
            rth->fl.iif == iif &&
            rth->fl.oif == 0 &&
#ifndef CONFIG_IP_ROUTE_FWMARK
            rth->fl.fl4_fwmark == skb->nfmark &&
#万一
            rth->fl.fl4_tos == tos) {
            ...
        }
}

仅当内核已编译为支持“IP:使用 netfilter MARK 值作为路由键”功能时,第 33 章中描述的路由缓存查找例程才会检查防火墙设置的标记ip_route_input值。

int ip_route_input(...)
{
    ...
        if (rth->fl.fl4_dst == daddr &&
            rth->fl.fl4_src == saddr &&
            rth->fl.iif == iif &&
            rth->fl.oif == 0 &&
#ifndef CONFIG_IP_ROUTE_FWMARK
            rth->fl.fl4_fwmark == skb->nfmark &&
#endif
            rth->fl.fl4_tos == tos) {
            ...
        }
}

The routing cache lookup routine ip_route_input, described in Chapter 33, checks the value of the tag set by the firewall only when the kernel has been compiled with support for the "IP: use netfilter MARK value as routing key" feature.

为函数选择正确的原型
To select the right prototype for a function
#ifdef CONFIG_IP_MULTIPLE_TABLES
结构 fib_table * fib_hash_init(int id)
#别的
结构 fib_table * _ _init fib_hash_init(int id)
{
    ...
}

在此示例中,当内核不支持策略路由时,指令用于将标记_ _init[ * ]添加到原型。

#ifdef CONFIG_IP_MULTIPLE_TABLES
struct fib_table * fib_hash_init(int id)
#else
struct fib_table * _ _init fib_hash_init(int id)
{
    ...
}

In this example, the directives are used to add the _ _init tag[*] to the prototype when the kernel does not have support for Policy Routing.

为函数选择正确的定义
To select the right definition for a function
#ifndef CONFIG_IP_MULTIPLE_TABLES
...
静态内联结构 fib_table *fib_get_table(int id)
{
    if (id!= RT_TABLE_LOCAL)
        返回ip_fib_main_table;
    返回 ip_fib_local_table
}
...
#别的
...
静态内联结构 fib_table *fib_get_table(int id)
{
    如果(id==0)
        id = RT_TABLE_MAIN;
    返回 fib_tables[id];
}
...
#万一

请注意,此案例与前一案例不同。在前一种情况下,函数体位于#ifdef/#endif块之外,而在这种情况下,每个块都包含函数的完整定义。

#ifndef CONFIG_IP_MULTIPLE_TABLES
...
static inline struct fib_table *fib_get_table(int id)
{
    if (id != RT_TABLE_LOCAL)
        return ip_fib_main_table;
    return ip_fib_local_table
}
...
#else
...
static inline struct fib_table *fib_get_table(int id)
{
    if (id == 0)
        id = RT_TABLE_MAIN;
    return fib_tables[id];
}
...
#endif

Note that this case differs from the previous one. In the previous case, the function body lies outside the #ifdef/#endif blocks, whereas in this case, each block contains a complete definition of the function.

变量和宏的定义或初始化也可以使用条件编译。

The definition or initialization of variables and macros can also use conditional compilation.

了解某些函数或宏的多个定义的存在非常重要,它们在编译时的选择基于预处理器宏,如前面的示例所示。否则,当您查找函数、变量或宏定义时,您可能会找错。

It is important to know about the existence of multiple definitions of certain functions or macros, whose selection at compile time is based on a preprocessor macro as in the preceding examples. Otherwise, when you look for a function, variable, or macro definition, you may be looking at the wrong one.

有关特殊宏的引入如何在某些情况下减少条件编译器指令的使用的讨论,请参阅第7 章。

See Chapter 7 for a discussion of how the introduction of special macros has reduced, in some cases, the use of conditional compiler directives.

条件检查的编译时优化

Compile-Time Optimization for Condition Checks

大多数时候,当内核将变量与某些外部值进行比较以查看是否满足给定条件时,结果极有可能是可预测的。这很常见,例如,对于强制健全性检查的代码。内核分别使用likelyunlikely宏来包装可能返回 true (1) 或 false (0) 结果的比较。这些宏利用了gcc 编译器的一项功能,可以根据该信息优化代码的编译。

Most of the time, when the kernel compares a variable against some external value to see whether a given condition is met, the result is extremely likely to be predictable. This is pretty common, for example, with code that enforces sanity checks. The kernel uses the likely and unlikely macros, respectively, to wrap comparisons that are likely to return a true (1) or false (0) result. Those macros take advantage of a feature of the gcc compiler that can optimize the compilation of the code based on that information.

这是一个例子。假设您需要调用该do_something函数,并且在失败的情况下,您必须使用该handle_error函数进行处理:

Here is an example. Let's suppose you need to call the do_something function, and that in case of failure, you must handle it with the handle_error function:

错误 = do_something(x,y,z);
如果(错误)
    处理错误(错误);
err = do_something(x,y,z);
if (err)
    handle_error(err);

在很少失败的假设下do_something,可以将代码重写如下:

Under the assumption that do_something rarely fails, you can rewrite the code as follows:

错误 = do_something(x,y,z);
如果(不太可能(错误))
    处理错误(错误);
err = do_something(x,y,z);
if (unlikely(err))
    handle_error(err);

likely和宏实现的优化的一个例子unlikely是处理 IP 标头中的选项。IP 选项的使用仅限于非常特定的情况,并且内核可以安全地假设大多数 IP 数据包不携带 IP 选项。当内核转发IP数据包时,它需要根据第18章中描述的规则处理选项。转发 IP 数据包的最后阶段由 负责ip_forward_finish。该函数使用unlikely宏来包装检查是否有任何 IP 选项需要处理的条件。请参见第 20 章中的“ ip_forward_finish 函数”部分。

An example of the optimization made possible by the likely and unlikely macros is in handling options in the IP header. The use of IP options is limited to very specific cases, and the kernel can safely assume that most IP packets do not carry IP options. When the kernel forwards an IP packet, it needs to take care of options according to the rules described in Chapter 18. The last stage of forwarding an IP packet is taken care of by ip_forward_finish. This function uses the unlikely macro to wrap the condition that checks whether there is any IP option to take care of. See the section "ip_forward_finish Function" in Chapter 20.

相互排斥

Mutual Exclusion

锁定在网络代码中广泛使用,您可能会在本书的每个主题下看到它作为一个问题。对于许多类型的编程(尤其是内核编程)来说,互斥、锁定机制和同步是一个常见主题,也是一个非常有趣且复杂的主题。Linux 已经引入和优化了几种互斥方法 这些年来。因此,本节仅总结网络代码中的锁定机制;我建议您参阅 O'Reilly 的Understanding the Linux Kernel and Linux Device Driver中提供的高质量、详细的讨论 。

Locking is used extensively in the networking code, and you are likely to see it come up as an issue under every topic in this book. Mutual exclusion, locking mechanisms, and synchronization are a general topic—and a highly interesting and complex one—for many types of programming, especially kernel programming. Linux has seen the introduction and optimization of several approaches to mutual exclusion over the years. Thus, this section merely summarizes the locking mechanisms seen in networking code; I refer you to the high-quality, detailed discussions available in O'Reilly's Understanding the Linux Kernel and Linux Device Driver.

每种互斥机制都是针对特定情况的最佳选择。以下是您在网络代码中经常看到的替代互斥方法的简要总结:

Each mutual exclusion mechanism is the best choice for particular circumstances. Here is a brief summary of the alternative mutual exclusion approaches you will see often in the networking code:

自旋锁
Spin locks

这是一种一次只能由一个执行线程持有的锁。另一个执行线程尝试获取锁会导致后者循环,直到锁被释放。因为循环、自旋锁造成的浪费仅在多处理器系统上使用,并且通常仅在开发人员期望锁保持较短时间间隔时使用。另外,由于对其他线程造成浪费,执行线程在持有自旋锁时不得休眠。

This is a lock that can be held by only one thread of execution at a time. An attempt to acquire the lock by another thread of execution makes the latter loop until the lock is released. Because of the waste caused by looping, spin locks are used only on multiprocessor systems, and generally are used only when the developer expects the lock to be held for short intervals. Also because of the waste caused to other threads, a thread of execution must not sleep while holding a spin lock.

读写自旋锁
Read-write spin locks

当给定锁的用途可以明确分为只读和读写时,使用读写自旋锁是优选的。自旋锁和读写自旋锁的区别在于,后者可以让多个读者同时持有该锁。但是,一次只有一个写入者可以持有该锁,并且当该锁已被写入者持有时,任何读取者都无法获取该锁。由于读取者的优先级高于写入者,因此当读取者的数量(或获取只读锁的数量)远大于写入者的数量(或读写锁的数量)时,这种类型的锁表现良好收购)。

当以只读模式获取锁时,无法直接提升为读写模式:必须在读写模式下释放并重新获取锁。

When the uses of a given lock can be clearly classified as read-only and read-write, the use of read-write spin locks is preferred. The difference between spin locks and read-write spin locks is that in the latter, multiple readers can hold the lock at the same time. However, only one writer at a time can hold the lock, and no reader can acquire it when it is already held by a writer. Because readers are given higher priority over writers, this type of lock performs well when the number of readers (or the number of read-only lock acquisitions) is a good deal bigger than the number of writers (or the number or read-write lock acquisitions).

When the lock is acquired in read-only mode, it cannot be promoted to read-write mode directly: the lock must be released and reacquired in read-write mode.

读取-复制-更新 (RCU)
Read-Copy-Update (RCU)

RCU 是 Linux 中提供互斥的最新机制之一。它在以下特定条件下表现良好:

  • 与只读锁请求相比,读写锁请求很少见。

  • 持有锁的代码以原子方式执行并且不会休眠。

  • 受锁保护的数据结构是通过指针访问的。

第一个条件涉及性能,另外两个条件是 RCU 工作原理的基础。

请注意,第一个条件建议使用读写自旋锁作为 RCU 的替代方案。要理解为什么 RCU 在适当使用时比读写自旋锁性能更好,您需要考虑其他方面,例如处理器缓存对 SMP 系统的影响。

RCU 设计背后的工作原理简单但功能强大。要清楚地描述 RCU 的优点及其实现的简要说明,请参阅其作者 Paul McKenney 在 Linux Journal 上发表的文章 ( http://linuxjournal.com/article/6993 )。[ * ]您还可以参考了解Linux内核Linux设备驱动程序

网络代码中使用 RCU 的一个示例是路由子系统。查找比缓存更新更频繁,并且实现路由缓存查找的例程不会在搜索过程中阻塞。参见第 33 章

RCU is one of the latest mechanisms made available in Linux to provide mutual exclusion. It performs quite well under the following specific conditions:

  • Read-write lock requests are rare compared to read-only lock requests.

  • The code that holds the lock is executed atomically and does not sleep.

  • The data structures protected by the lock are accessed via pointers.

The first condition concerns performance, and the other two are at the base of the RCU working principle.

Note that the first condition would suggest the use of read-write spin locks as an alternative to RCU. To understand why RCU, when its use is appropriate, performs better than read-write spin locks, you need to consider other aspects, such as the effect of the processor caches on SMP systems.

The working principle behind the design of RCU is simple yet powerful. For a clear description of the advantages of RCU and a brief description of its implementation, refer to an article published by its author, Paul McKenney, in the Linux Journal (http://linuxjournal.com/article/6993).[*] You can also refer to Understanding the Linux Kernel and Linux Device Drivers.

An example where RCU is used in the networking code is the routing subsystem. Lookups are more frequent than updates on the cache, and the routine that implements the routing cache lookup does not block in the middle of the search. See Chapter 33.

信号量由内核提供,但很少在本书介绍的网络代码中使用。然而,一个例子是用于序列化配置更改的代码,我们将在第 8 章中看到它的实际效果。

Semaphores are offered by the kernel but are rarely used in the networking code covered in this book. One example, however, is the code used to serialize configuration changes, which we will see in action in Chapter 8.

主机和网络顺序之间的转换

Conversions Between Host and Network Order

跨越一个字节以上的数据结构可以以两种不同的格式存储在内存中:Little Endian 和 Big Endian。第一种格式将最低有效字节存储在最低内存地址处,第二种格式则相反。Linux 等操作系统使用的格式取决于所使用的处理器。例如,Intel 处理器遵循 Little Endian 模型,而 Motorola 处理器则使用 Big Endian 模型。

Data structures spanning more than one byte can be stored in memory with two different formats: Little Endian and Big Endian. The first format stores the least significant byte at the lowest memory address, and the second does the opposite. The format used by an operating system such as Linux depends on the processor in use. For example, Intel processors follow the Little Endian model, and Motorola processors use the Big Endian model.

假设我们的 Linux 机器从远程主机接收到一个 IP 数据包。因为它不知道远程主机使用哪种格式(Little Endian 或 Big Endian)来初始化协议标头,因此它将如何读取标头?因此,每个协议族都必须定义什么是“字节序”例如,TCP/IP 堆栈遵循 Big Endian 模型。

Suppose our Linux box receives an IP packet from a remote host. Because it does not know which format, Little Endian or Big Endian, was used by the remote host to initialize the protocol headers, how will it read the header? For this reason, each protocol family must define what "endianness " it uses. The TCP/IP stack, for example, follows the Big Endian model.

但这仍然给内核开发人员带来了一个问题:她必须编写可以在支持不同字节序的许多不同处理器上运行的代码。某些处理器可能会匹配传入数据包的字节顺序,但那些不需要转换为处理器使用的字节顺序的处理器。

But this still leaves the kernel developer with a problem: she must write code that can run on many different processors that support different endianness. Some processors might match the endianness of the incoming packet, but those that do not require conversion to the endianness used by the processor.

因此,每次内核需要读取、保存或比较跨越多个字节的 IP 标头字段时,它必须首先将其从网络字节顺序转换为主机字节顺序,反之亦然。这同样适用于 TCP/IP 堆栈的其他协议。当协议和本地主机都是 Big Endian 时,转换例程只是无操作,因为不需要任何转换。它们总是出现在代码中以使代码可移植;只有转换例程本身是平台相关的。表1-2列出了用于两字节和四字节字段转换的主要宏。

Therefore, every time the kernel needs to read, save, or compare a field of the IP header that spans more than one byte, it must first convert it from network byte order to host byte order or vice versa. The same applies to the other protocols of the TCP/IP stack. When both the protocol and the local host are Big Endian, the conversion routines are simply no-ops because there is no need for any conversion. They always appear in the code to make the code portable; only the conversion routines themselves are platform dependent. Table 1-2 lists the main macros used for the conversion of two-byte and four-byte fields.

表 1-2。字节顺序转换例程

Table 1-2. Byte-ordering conversion routines

Macro

含义(短为2字节,长为4字节)

Meaning (short is 2 bytes, long is 4 bytes)

htons

htons

主机到网络字节顺序(短)

Host-to-network byte order (short)

htonl

htonl

主机到网络字节顺序(长)

Host-to-network byte order (long)

ntohs

ntohs

网络到主机字节顺序(短)

Network-to-host byte order (short)

ntohl

ntohl

网络到主机字节顺序(长)

Network-to-host byte order (long)

这些宏在通用头文件include/linux/byteorder/generic.h中定义。这是每个架构如何根据这些宏的字节顺序来定制这些宏的定义:

The macros are defined in the generic header file include/linux/byteorder/generic.h. This is how each architecture tailors the definition of those macros based on their endianness:

  • 对于每个体系结构,每个体系结构目录include/asm-XXX/中都有一个byteorder.h文件。

  • For each architecture there is a byteorder.h file in the per-architecture directory include/asm-XXX/.

  • 该文件包含include/linux/byteorder/big_endian.hinclude/linux/byteorder/little_endian.h,具体取决于处理器的字节顺序。

  • That file includes either include/linux/byteorder/big_endian.h or include/linux/byteorder/little_endian.h, depending on the processor's endianness.

  • little_endian.hbig_endian.h都包含通用文件include/linux/byteorder/generic.h表 1-2中的宏定义基于little_endian.hbig_endian.h中不同定义的其他宏,这就是体系结构的字节顺序如何影响表 1-2中的宏定义。

  • Both little_endian.h and big_endian.h include the generic file include/linux/byteorder/generic.h. The definitions of the macros in Table 1-2 are based on other macros that are defined differently by little_endian.h and big_endian.h, and this is how the endianness of the architecture influences the definition of the macros of Table 1-2.

对于表 1-2xxx中的每个宏,都有一个姊妹宏 ,当输入字段是常量值时使用,例如枚举列表的元素(参见第 28 章中的“ ARP​​ 协议初始化”部分的示例) )。请注意,表 1-2中的宏通常在内核代码中使用,即使它们的输入是常量值(有关示例,请参见第 13 章中的“设置以太网协议和长度”部分)。_ _constant_ xxx

For each macro xxx in Table 1-2 there is a sister macro, _ _constant_ xxx, that is used when the input field is a constant value, such as an element of an enumeration list (see the section "ARP Protocol Initialization" in Chapter 28 for an example). Note that the macros in Table 1-2 are commonly used in the kernel code even when their input is a constant value (see the section "Setting the Ethernet Protocol and Length" in Chapter 13 for an example).

我们在本节前面说过,当数据字段跨越一个字节以上时,字节顺序很重要。当一个或多个字节的字段被定义为位字段的集合时,字节顺序实际上也很重要。例如,请参见第 18 章中的图 18-2中的 IPv4 标头,以及内核如何 在include/linux/ip.h中定义该结构。内核分别在前面提到的little_endian.hbig_endian.h文件中定义了和 。iphdr_ _LITTLE_ENDIAN_BITFIELD_ _BIG_ENDIAN_BITFIELD

We said earlier in the section that endianness is important when a data field spans more than one byte. Endianness is actually important also when a field of one or more bytes is defined as a collection of bitfields. See, for example, what the IPv4 header looks like in Figure 18-2 in Chapter 18, and how the kernel defines the iphdr structure in include/linux/ip.h. The kernel defines _ _LITTLE_ENDIAN_BITFIELD and _ _BIG_ENDIAN_BITFIELD, respectively, in the little_endian.h and big_endian.h files mentioned earlier.

捉虫子

Catching Bugs

一些函数应该在特定条件下被调用,或者在某些条件下应该被调用。内核使用BUG_ONBUG_TRAP宏来捕获不满足此类条件的情况。当输入条件为BUG_TRAP假时,内核会打印一条警告消息。BUG_ON相反,打印一条错误消息并出现恐慌。

A few functions are supposed to be called under specific conditions, or are not supposed to be called under certain conditions. The kernel uses the BUG_ON and BUG_TRAP macros to catch cases where such conditions are not met. When the input condition to BUG_TRAP is false, the kernel prints a warning message. BUG_ON instead prints an error message and panics.

统计数据

Statistics

收集统计数据是一个好习惯关于特定条件的发生,例如缓存查找成功和失败、内存分配成功和失败等。对于每个收集统计数据的网络功能,本书都列出并描述了每个计数器。

It is a good habit for a feature to collect statistics about the occurrence of specific conditions, such as cache lookup successes and failures, memory allocation successes and failures, etc. For each networking feature that collects statistics, this book lists and describes each counter.

测量时间

Measuring Time

内核通常需要测量自给定时刻以来已经过去了多少时间。例如,执行 CPU 密集型任务的例程通常会在给定时间后释放 CPU。当重新安排执行时,它将继续其工作。即使内核支持内核抢占,这一点在内核代码中尤其重要。网络代码中的一个常见示例是实现垃圾收集的例程。我们将在本书中看到很多内容。

The kernel often needs to measure how much time has passed since a given moment. For example, a routine that carries on a CPU-intensive task often releases the CPU after a given amount of time. It will continue its job when it is rescheduled for execution. This is especially important in kernel code, even though the kernel supports kernel preemption. A common example in the networking code is given by the routines that implement garbage collection. We will see plenty in this book.

内核空间中时间的流逝以刻度为单位。时间间隔是定时器中断连续两次到期之间的时间。计时器负责不同的任务(我们在这里对它们不感兴趣)并且HZ 每秒定期到期。HZ是由依赖于体系结构的代码初始化的变量。例如,在 i386 机器上它被初始化为 1,000。这意味着当 Linux 在 i386 系统上运行时,计时器中断每秒到期 1,000 次,并且连续两次到期之间有 1 毫秒。

The passing of time in kernel space is measured in ticks . A tick is the time between two consecutive expirations of the timer interrupt. The timer takes care of different tasks (we are not interested in them here) and regularly expires HZ times per second. HZ is a variable initialized by architecture-dependent code. For example, it is initialized to 1,000 on i386 machines. This means that the timer interrupt expires 1,000 times per second when Linux runs on an i386 system, and that there is one millisecond between two consecutive expirations.

每次计时器到期时,它都会递增名为 的全局变量jiffies。这意味着在任何时间,jiffies代表自系统启动以来的滴答数,通用值n*HZ代表n时间的秒数。

Every time the timer expires it increments the global variable called jiffies. This means that at any time, jiffies represents the number of ticks since the system booted, and the generic value n*HZ represents n seconds of time.

如果函数需要的只是测量时间的流逝,它可以将 的值保存 jiffies到局部变量中,然后将 和 时间戳与时间间隔(以滴答数表示)之间的差异进行比较jiffies,以查看已经过去了多少时间自测量开始以来。

If all a function needs is to measure the passing of time, it can save the value of jiffies into a local variable and later compare the difference between jiffies and that timestamp against a time interval (expressed in number of ticks) to see how much time has passed since measurement started.

以下示例显示了一个需要执行某种工作但不希望占用 CPU 时间超过一个周期的函数。当通过设置为非零值do_something表示工作已完成时,该函数可以返回:job_done

The following example shows a function that needs to do some kind of work but does not want to hold the CPU for more than one tick. When do_something says the work is completed by setting job_done to a nonzero value, the function can return:

无符号长 start_time = jiffies;
int job_done = 0;
做 {
    do_something(&job_done);
    如果(工作完成)
        返回;
while (jiffies - 开始时间 < 1);
unsigned long start_time = jiffies;
int job_done = 0;
do {
    do_something(&job_done);
    If (job_done)
        return;
while (jiffies - start_time < 1);

有关使用实际内核代码的几个示例,请参阅第 10 章中的“积压处理:process_backlog Poll 虚拟函数jiffies部分,或第 27 章中的“异步清理:neigh_periodic_timer 函数”部分。

For a couple of examples involving real kernel code using jiffies, see the section "Backlog Processing: The process_backlog Poll Virtual Function" in Chapter 10, or the section "Asynchronous cleanup: the neigh_periodic_timer function" in Chapter 27.

用户空间工具

User-Space Tools

可以使用不同的工具来配置 Linux 上可用的许多网络功能。正如本章开头提到的,您可以出于学习目的而深思熟虑地使用这些工具来操作内核并发现这些更改的影响。

Different tools can be used to configure the many networking features available on Linux. As mentioned at the beginning of the chapter, you can make thoughtful use of these tools to manipulate the kernel for learning purposes and to discover the effects of these changes.

以下工具是我在本书中经常提到的工具:

The following tools are the ones I will refer often to in this book:

iputils
iputils

除了常年命令之外pingiputils还包括arping (用于生成 ARP 请求)、网络路由器发现守护进程rdisc等。

Besides the perennial command ping, iputils includes arping (used to generate ARP requests), the Network Router Discovery daemon rdisc, and others.

网络工具
net-tools

这是一套网络工具,您可以在其中找到众所周知的ifconfigroutenetstatarp,还可以找到ipmaddriptunnelether-wakenetplugd等。

This is a suite of networking tools, where you can find the well-known ifconfig, route, netstat, and arp, but also ipmaddr, iptunnel, ether-wake, netplugd, etc.

路由2
IPROUTE2

这是新一代网络配置套件(尽管它已经存在了几年)。通过名为ip的综合命令,该套件可用于配置 IP 地址和路由及其所有高级功能、相邻协议等。

This is the new-generation networking configuration suite (although it has been around for a few years already). Through an omnibus command named ip, the suite can be used to configure IP addresses and routing along with all of its advanced features, neighboring protocols, etc.

IPROUTE2的源代码可以从http://linux-net.osdl.org/index.php/Iproute2下载,其他软件包可以从大多数Linux发行版的下载服务器下载。

IPROUTE2's source code can be downloaded from http://linux-net.osdl.org/index.php/Iproute2, and the other packages can be downloaded from the download server of most Linux distributions.

大多数(如果不是全部)Linux 发行版默认包含这些软件包。每当你不明白内核代码如何处理来自用户空间的命令时,我鼓励你查看用户空间工具源代码,看看来自用户的命令是如何打包并发送到内核的。

These packages are included by default on most (if not all) Linux distributions. Whenever you do not understand how the kernel code processes a command from user space, I encourage you to look at the user-space tool source code and see how the command from the user is packaged and sent to the kernel.

在以下 URL 中,您可以找到有关如何使用上述工具的良好文档,包括活动邮件列表:[ * ]

At the following URLs, you can find good documentation on how to use the aforementioned tools, including active mailing lists:[*]

如果您想关注网络代码的最新更改,请关注以下邮件列表:

If you want to follow the latest changes in the networking code, keep an eye on the following mailing list:

其他更具体的 URL 将在相关章节中给出。

Other, more specific URLs will be given in the associated chapters.

浏览源代码

Browsing the Source Code

Linux 内核已经变得相当大,用我们的老朋友grep浏览代码绝对不再是一个好主意。如今,您可以依靠不同的软件来让您的内核代码之旅获得更好的体验。

The Linux kernel has gotten pretty big, and browsing the code with our old friend grep is definitely not a good idea anymore. Nowadays you can count on different pieces of software to make your journey into the kernel code a better experience.

我想向那些还不知道它的人推荐一个cscope,您可以从http://cscope.sourceforge.net下载它。它是一个简单但功能强大的工具,用于搜索例如函数或变量的定义位置、调用位置等。安装该工具非常简单,您可以在网站上找到所有必要的说明。

One that I would like to suggest to those that do not know it already is cscope, which you can download from http://cscope.sourceforge.net. It is a simple yet powerful tool for searching, for example, where a function or variable is defined, where it is called, etc. Installing the tool is straightforward and you can find all the necessary instructions on the web site.

我们每个人都有自己喜欢的编辑器,并且可能我们大多数人都是某种形式的 Emacs 或vi的粉丝。两位编辑可以使用称为“标签”文件的特殊文件,以允许用户浏览源代码。(cscope 也使用类似的数据库文件。)您可以轻松地在内核根树的makefile中使用同义目标创建此类文件。三个数据库:TAGS、tags 和 cscope.out 分别使用make TAGSmake Tagsmake cscope创建。[ * ]

Each of us has his preferred editor, and probably the majority of us are fans of some form of either Emacs or vi. Both editors can use a special file called a "tags" file, to allow the user to move through source code. (cscope also uses a similar database file.) You can easily create such files with a synonymous target in the kernel root tree's makefile. The three databases: TAGS, tags, and cscope.out, are created, respectively, with make TAGS, make tags, and make cscope.[*]

请注意,这些文件非常大,尤其是cscope使用的文件。因此,在构建文件之前请确保您有大量可用磁盘空间。

Be aware that those files are pretty big, especially the one used by cscope. Therefore, make sure before building the file that you have a lot of free disk space.

如果您已经在使用其他源导航工具,那很好。但是,如果您没有使用任何工具并且到目前为止一直很懒,那么是时候告别grep并花 15 分钟学习如何使用上述工具 - 它们非常值得。

If you are already using other source navigation tools, fine. But if you are not using any and have been lazy so far, it is time to say goodbye to grep and invest 15 minutes in learning how to use the aforementioned tools—they are well worth it.

死代码

Dead Code

与任何其他大型动态软件一样,内核包含不再被调用的代码片段。不幸的是,您很少在代码中看到告诉您这一点的注释。有时您可能会发现自己在尝试理解如何使用给定函数或如何初始化给定变量时遇到困难,仅仅是因为您正在查看死代码。如果幸运的话,该代码不会编译,您可以猜测它的过时状态。其他时候你可能就没那么幸运了。

The kernel, like any other large and dynamic piece of software, includes pieces of code that are no longer invoked. Unfortunately, you rarely see comments in the code that tell you this. You may sometimes find yourself having trouble trying to understand how a given function is used or a given variable is initialized simply because you are looking at dead code. If you are lucky, that code does not compile and you can guess its out-of-date status. Other times you may not be that lucky.

每个内核子系统都应该分配一个或多个维护者。然而,一些维护人员只是有太多代码需要查看,而没有足够的空闲时间来完成。其他时候,他们可能对维护子系统失去了兴趣,但找不到任何替代品。因此,在查看似乎做了一些奇怪的事情或根本不遵守一般常识性编程规则的代码时,最好记住这一点。

Each kernel subsystem is supposed to be assigned one or more maintainers. However, some maintainers simply have too much code to look at, and insufficient free time to do it. Other times they may have lost interest in maintaining their subsystems but could not find any substitutes for their role. It is therefore good to keep this in mind when looking at code that seems to do something strange or that simply does not adhere to general, common-sense programming rules.

在本书中,只要有意义,我都会尝试提醒您有关未使用的函数、变量和数据结构字段的信息,可能是因为它们在删除功能时被遗漏,或者因为它们是为新功能而引入的,而该新功能的编码从未被使用过。完全的。

In this book, I tried, whenever meaningful, to alert you about functions, variables, and data structure fields that are not used, perhaps because they were left behind when removing a feature or because they were introduced for a new feature whose coding was never completed.

当某个功能作为补丁提供时

When a Feature Is Offered as a Patch

内核网络代码在不断发展。它不仅集成了新功能,而且现有组件有时也会进行设计更改,以实现更多的模块化和更高的性能。这显然使得 Linux 作为网络设备产品(路由器、交换机、防火墙、负载均衡器等)的嵌入式操作系统非常有吸引力。

The kernel networking code is continuously evolving. Not only does it integrate new features, but existing components sometimes undergo design changes to achieve more modularity and higher performance. This obviously makes Linux very attractive as an embedded operating system for network appliance products (routers, switches, firewalls, load balancers, etc.).

因为任何人都可以为 Linux 内核开发新功能,或者扩展或重新实现现有功能,所以对于任何“开放”开发人员来说,最大的兴奋就是看到自己的工作进入官方内核版本。然而,有时这是不可能的,或者可能需要很长时间,即使项目具有有价值的功能并且得到了很好的实施。常见原因包括:

Because anyone can develop a new feature for the Linux kernel, or extend or reimplement an existing one, the greatest thrill for any "open" developer is to see her work make it to the official kernel release. Sometimes, however, that is not possible or it may take a long time, even when a project has valuable features and is well implemented. Common reasons include:

  • 代码可能没有按照Documentation/CodingStyle中的指南编写。

  • The code may not have been written following the guidelines in Documentation/CodingStyle.

  • 另一个提供相同功能的主要项目已经存在了一段时间,并且已经获得了 Linux 社区和维护相关内核区域的主要内核开发人员的批准。

  • Another major project that provides the same functionality has been around for some time and has already received the green light from the Linux community and from the key kernel developers that maintain the associated kernel area.

  • 与另一个内核组件有太多重叠。在这种情况下,最好的方法是删除冗余功能并尽可能使用现有功能,或者扩展后者以便可以在新的上下文中使用。这种情况凸显了模块化的重要性。

  • There is too much overlap with another kernel component. In a case like this, the best approach is to remove the redundant functionality and use existing functionality where possible, or to extend the latter so that it can be used in new contexts. This situation underlines the importance of modularity.

  • 项目的规模以及在快速变化的内核中维护它所需的工作量可能会导致新项目的开发人员将其作为单独的补丁保留,并且仅偶尔发布新版本。

  • The size of the project and the amount of work required to maintain it in a quick-changing kernel may lead the new project's developers to keep it as a separate patch and release a new version only once in a while.

  • 该功能仅在非常特定的场景中使用,在通用操作系统中被认为是不必要的。在这种情况下,单独的补丁通常是最好的解决方案。

  • The feature would be used only in very specific scenarios, considered not necessary in a general-purpose operating system. In this case, a separate patch is often the best solution.

  • 整体设计可能无法满足一些关键内核开发人员的要求。这些专家通常有大局观,既关心内核的位置,也关心它的去向。通常,他们要求更改设计以使某个功能以正确的方式适合内核。

  • The overall design may not satisfy some key kernel developers. These experts usually have the big picture in mind, concerning both where the kernel is and where it is going. Often, they request design changes to make a feature fit into the kernel the right way.

有时,功能之间的重叠很难完全消除,例如,可能是因为某个功能非常灵活,以至于它的不同用途只有在一段时间后才会变得明显。例如,防火墙在网络堆栈的多个位置都有钩子。这使得其他功能无需对任何方向的数据包进行任何过滤或标记:它们可以简单地依赖防火墙。当然,这会产生依赖性(例如,如果路由子系统想要标记符合特定条件的流量,则内核必须包含对防火墙的支持)。此外,当其他内核功能需要合理的增强请求时,防火墙维护人员必须准备好接受这些请求。然而,

Sometimes, overlap between features is hard to remove completely, perhaps, for example, because a feature is so flexible that its different uses become apparent only after some time. For example, the firewall has hooks in several places in the network stack. This makes it unnecessary for other features to implement any filtering or marking of data packets going in any direction: they can simply rely on the firewall. Of course, this creates dependencies (for example, if the routing subsystem wants to mark traffic matching specific criteria, the kernel must include support for the firewall). Also, the firewall maintainers must be ready to accept reasonable enhancement requests when they are deemed to be required by other kernel features. However, the compromise is often worth the gain: less redundant code means fewer bugs, easier code maintenance, simplified code paths, and other benefits.

最近清理功能重叠的一个例子是删除无状态2.6 版内核中的路由代码支持网络地址转换 (NAT)。开发人员意识到防火墙中的有状态 NAT 支持更加灵活,因此不再值得维护无状态 NAT 代码(尽管它更快并且消耗更少的内存)。请注意,如有必要,可以随时为 Netfilter 编写新模块以提供无状态 NAT 支持。

An example of a recent cleanup of feature overlap is the removal of stateless Network Address Translation (NAT) support by the routing code in version 2.6 of the kernel. The developers realized that the stateful NAT support in the firewall is more flexible, and therefore that it was no longer worthwhile maintaining stateless NAT code (although it is faster and consumes less memory). Note that a new module could be written for Netfilter at any time to provide stateless NAT support if necessary.




[ * ]有关该宏的说明,请参阅第 7 章。

[*] See Chapter 7 for a description of this macro.

[ * ]更多文档,可以参考作者维护的以下网址: http: //www.rdrop.com/users/paulmck/rclock

[*] For more documentation, you can refer to the following URL maintained by the author: http://www.rdrop.com/users/paulmck/rclock.

[ * ]我在本书中不涉及防火墙基础设施设计,但在分析各种网络协议和层时,我经常展示防火墙挂钩的位置。

[*] I do not cover the firewall infrastructure design in this book, but I often show where the firewall hooks are located when analyzing various network protocols and layers.

[ * ]标签和TAGS文件是在ctags实用程序帮助下创建的。

[*] The tags and TAGS files are created with the help of the ctags utility.

第 2 章关键数据结构

Chapter 2. Critical Data Structures

几个关键的数据结构在整个 Linux 网络代码中都被引用。无论是阅读本书还是直接研究源代码,您都需要了解这些数据结构中的字段。可以肯定的是,逐个字段地研究数据结构没有解开函数那么有趣,但它是一个重要的基础。“向我展示你的数据,”传奇软件工程师 Frederick P. Brooks 说道。

A few key data structures are referenced throughout the Linux networking code. Both when reading this book and when studying the source code directly, you'll need to understand the fields in these data structures. To be sure, going over data structures field by field is less fun than unraveling functions, but it's an important foundation to have. "Show me your data," said the legendary software engineer, Frederick P. Brooks.

本章介绍以下数据结构,并提到一些操作它们的函数和宏:

This chapter introduces the following data structures, and mentions some of the functions and macros that manipulate them:

struct sk_buff
struct sk_buff

这是存储数据包的地方。所有网络层都使用该结构来存储其标头、有关用户数据(有效负载)的信息以及内部协调其工作所需的其他信息。

This is where a packet is stored. The structure is used by all the network layers to store their headers, information about the user data (the payload), and other information needed internally for coordinating their work.

struct net_device
struct net_device

每个网络设备在 Linux 内核中都由该数据结构表示,其中包含有关其硬件和软件配置的信息。有关何时以及如何 分配数据结构的详细信息,请参阅 第 8 章。net_device

Each network device is represented in the Linux kernel by this data structure, which contains information about both its hardware and its software configuration. See Chapter 8 for details on when and how net_device data structures are allocated.

Linux 网络的另一个关键数据结构是struct sock,它存储套接字的网络信息。由于本书不涉及套接字,因此我没有将其包含sock在本章中。

Another critical data structure for Linux networking is struct sock, which stores the networking information for sockets. Because this book does not cover sockets, I have not included sock in this chapter.

套接字缓冲区:sk_buff 结构

The Socket Buffer: sk_buff Structure

这可能是 Linux 网络代码中最重要的数据结构,表示已接收或即将发送的数据的标头。它在<include/linux/skbuff.h>包含文件中定义,由大量变量组成,试图满足所有人的需求。

This is probably the most important data structure in the Linux networking code, representing the headers for data that has been received or is about to be transmitted. Defined in the <include/linux/skbuff.h> include file, it consists of a tremendous heap of variables that try to be all things to all people.

在内核的历史中,结构已经改变了很多次,既添加了新的选项,也将现有的字段重新组织成更清晰的布局。其领域大致可分为以下几类:

The structure has changed many times in the history of the kernel, both to add new options and to reorganize existing fields into a cleaner layout. Its fields can be classified roughly into the following categories:

  • 布局

  • Layout

  • 一般的

  • General

  • 特定功能

  • Feature-specific

  • 管理职能

  • Management functions

该结构由多个不同的网络层使用(L2 层上的 MAC 或其他链路协议、L3 上的 IP、L4 上的 TCP 或 UDP),并且该结构的各个字段在从一层传递到另一层时会发生变化。L4 在将其传递给 L3 之前附加一个标头,L3 又在将其传递给 L2 之前添加自己的标头。附加标头比将数据从一层复制到另一层更有效。由于向缓冲区的开头添加空间(这意味着更改指向它的变量)是一项复杂的操作,因此内核提供了执行该操作的函数skb_reserve(本章稍后介绍)。因此,当缓冲区向下传递各层时,每个协议要做的第一件事就是调用skb_reserve为协议头保留空间。[ 1 ]在后面的“数据保留和对齐:skb_reserve、skb_put、skb_push 和 skb_pull ”部分中,我们将看到一个示例,说明内核如何确保在缓冲区的头部保留足够的空间以允许每一层添加当缓冲区遍历各层时,它有自己的标头。

This structure is used by several different network layers (MAC or another link protocol on the L2 layer, IP on L3, TCP or UDP on L4), and various fields of the structure change as it is passed from one layer to another. L4 appends a header before passing it to L3, which in turn puts on its own header before passing it to L2. Appending headers is more efficient than copying the data from one layer to another. Since adding space to the beginning of a buffer—which means changing the variable that points to it—is a complicated operation, the kernel provides the skb_reserve function (described later in this chapter) to carry it out. Thus, one of the first things done by each protocol, as the buffer passes down through layers, is to call skb_reserve to reserve space for the protocol's header.[1] In the later section "Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull," we will see an example of how the kernel makes sure enough space is reserved at the head of the buffer to allow each layer to add its own header while the buffer traverses the layers.

当缓冲区向上通过网络层时,旧层的每个标头不再感兴趣。例如,L2 标头仅由处理 L2 协议的设备驱动程序使用,因此 L3 对其不感兴趣。不是从缓冲区中删除 L2 标头,而是将指向有效负载开头的指针向前移动到 L3 标头的开头,这需要更少的 CPU 周期。

When the buffer passes up through the network layers, each header from the old layer is no longer of interest. The L2 header, for instance, is used only by the device drivers that handle the L2 protocol, so it is of no interest to L3. Instead of removing the L2 header from the buffer, the pointer to the beginning of the payload is moved ahead to the beginning of the L3 header, which requires fewer CPU cycles.

本节的其余部分解释有关条件(可选)字段的基本原则,然后涵盖刚刚列出的每个类别。

The rest of this section explains a basic principle about conditional (optional) fields, and then covers each of the categories just listed.

网络选项和内核结构

Networking Options and Kernel Structures

正如您通过查看 TCP/IP 规范或配置内核所看到的那样,网络代码提供了大量有用但并不总是必需的选项,例如防火墙、多播和其他功能。大多数这些选项都需要内核数据结构中的附加字段。因此,sk_buff充满了 C 预处理器#ifdef指令。例如,在定义底部附近sk_buff您可以找到:

As you can see from glancing at TCP/IP specifications or configuring a kernel, network code provides an enormous number of options that are useful but not always required, such as a Firewall, Multicasting, and other features. Most of these options require additional fields in kernel data structures. Therefore, sk_buff is peppered with C preprocessor #ifdef directives. For example, near the bottom of the sk_buff definition you can find:

结构体sk_buff {
    …………
#ifdef CONFIG_NET_SCHED
    _ _u32 tc_index;
#ifdef CONFIG_NET_CLS_ACT
    _ _u32 tc_verd;
    _ _u32 tc_classid;
#万一
#万一
}
struct sk_buff {
    ... ... ...
#ifdef CONFIG_NET_SCHED
    _ _u32    tc_index;
#ifdef CONFIG_NET_CLS_ACT
    _ _u32    tc_verd;
    _ _u32    tc_classid;
#endif
#endif
}

这表明,tc_index仅当符号在编译时定义时,该字段才是数据结构的一部分CONFIG_NET_SCHED,这意味着正确的选项(在本例中,“设备驱动程序 → 网络支持 → 网络选项 → QoS 和/或公平队列”)已由管理员或自动安装实用程序使用某些版本的make config启用。

This shows that the field tc_index is part of the data structure only if the CONFIG_NET_SCHED symbol is defined at compile time, which means that the right option (in this example, "Device Drivers → Networking support → Networking options → QoS and/or fair queueing") has been enabled with some version of make config by an administrator or by an automated installation utility.

CONFIG_NET_CLS_ACT前面的示例实际上显示了两个嵌套选项:仅当存在对“QoS 和/或公平排队”的支持时,才会考虑包含(数据包分类器)使用的字段。

The previous example actually shows two nested options: the fields used by CONFIG_NET_CLS_ACT (packet classifier) are considered for inclusion only if support for "QoS and/or fair queueing" is present.

顺便请注意,QoS 选项不能编译为模块。原因是,启用该选项的大部分后果在内核编译后将是不可逆的。一般来说,任何导致内核数据结构发生变化的选项(例如将字段添加tc_indexsk_buff结构中)都会导致该选项不适合编译为模块。

Notice, by the way, that the QoS option cannot be compiled as a module. The reason is that most of the consequences of enabling the option will not be reversible after the kernel is compiled. In general, any option that causes a change in a kernel data structure (such as adding the tc_index field to the sk_buff structure) renders the option unfit to be compiled as a module.

您经常希望找出make config中的哪个编译选项或其变体与给定符号相关联#ifdef,以了解代码块何时包含在内核中。在 2.6 内核中,建立关联的最快方法是在遍布整个源代码树(每个目录一个)的kconfig文件中查找符号。在 2.4 内核中,您可以查阅文件Documentation/Configure.help

You'll often want to find out which compile option from make config or its variants is associated with a given #ifdef symbol, to understand when a block of code is included in the kernel. The fastest way to make the association, in the 2.6 kernels, is to look for the symbol in the kconfig files that are spread all over the source tree (one per directory). In 2.4 kernels, you can consult the file Documentation/Configure.help.

布局字段

Layout Fields

的一些sk_buff字段的存在只是为了方便搜索和组织数据结构本身。内核将所有sk_buff结构维护在双向链表中。但这个列表的组织比传统的双向链表要复杂一些。

A few of the sk_buff's fields exist just to facilitate searching and to organize the data structure itself. The kernel maintains all sk_buff structures in a doubly linked list. But the organization of this list is somewhat more complicated than that of a traditional doubly linked list.

next与任何双向链表一样,该链表由每个结构中的和prev字段连接在一起sk_buffnext向前指向的字段和prev 向后指向的字段。但这个列表还有另一个要求:每个sk_buff结构必须能够快速找到整个列表的头部。为了实现这个要求,在列表的开头插入一个额外的类型结构sk_buff_head,作为一种虚拟元素。结构sk_buff_head是:

Like any doubly linked list, this one is tied together by next and prev fields in each sk_buff structure, the next field pointing forward and the prev field pointing backward. But this list has another requirement: each sk_buff structure must be able to find the head of the whole list quickly. To implement this requirement, an extra structure of type sk_buff_head is inserted at the beginning of the list, as a kind of dummy element. The sk_buff_head structure is:

结构体sk_buff_head {
    /* 这两个成员必须是第一个。*/
    结构sk_buff*下一个;
    结构 sk_buff * 前一个;

    _ _u32 qlen;
    spinlock_t 锁;
};
struct sk_buff_head {
    /* These two members must be first. */
    struct sk_buff    * next;
    struct sk_buff    * prev;

    _ _u32        qlen;
    spinlock_t    lock;
};

qlen表示列表中元素的数量。lock用于防止同时访问列表,并在本章后面的“列表管理功能”部分中进行描述。

qlen represents the number of elements in the list. lock is used to prevent simultaneous accesses to the list and is described in the section "List management functions," later in this chapter.

sk_buffand 的前两个元素sk_buff_head是相同的:thenextprev指针。这允许两个结构共存于同一个列表中,尽管sk_buff_headsk_buff. 此外,相同的函数可用于操作sk_buffsk_buff_head

The first two elements of both sk_buff and sk_buff_head are the same: the next and prev pointers. This allows the two structures to coexist in the same list, even though sk_buff_head is positively skimpy in comparison to sk_buff. In addition, the same functions can be used to manipulate both sk_buff and sk_buff_head.

为了增加复杂性,每个sk_buff结构都包含一个指向单个sk_buff_head 结构的指针。该指针具有字段名称list。请参阅 图 2-1以帮助您找到解决这些数据结构的方法。

To add to the complexity, every sk_buff structure contains a pointer to the single sk_buff_head structure. This pointer has the field name list. See Figure 2-1 for help finding your way around these data structures.

sk_buff 元素列表

图 2-1。sk_buff 元素列表

Figure 2-1. List of sk_buff elements

其他有趣的领域sk_buff如下:

Other interesting fields of sk_buff follow:

struct sock *sk
struct sock *sk

sock这是一个指向拥有该缓冲区的套接字数据结构的指针。当数据是本地生成的或由本地进程接收时,需要此指针,因为数据和套接字相关信息由 L4(TCP 或 UDP)和用户应用程序使用。当缓冲区仅被转发时(即源和目标都不在本地计算机上),该指针为 NULL。

This is a pointer to a sock data structure of the socket that owns this buffer. This pointer is needed when data is either locally generated or being received by a local process, because the data and socket-related information is used by L4 (TCP or UDP) and by the user application. When a buffer is merely being forwarded (that is, neither the source nor the destination is on the local machine), this pointer is NULL.

unsigned int len
unsigned int len

这是缓冲区中数据块的大小。该长度包括主缓冲区中的数据(即 指向的head数据)和片段中的数据。[ 1 ]当缓冲区从一个网络层移动到下一个网络层时,它的值会发生变化,因为标头在堆栈中向上移动时会被丢弃,而在堆栈中向下移动时会添加标头。len也考虑了协议头,如图2-8中的“数据保留和对齐:skb_reserve、skb_put、skb_push 和 skb_pull ”部分所示。

This is the size of the block of data in the buffer. This length includes both the data in the main buffer (i.e., the one pointed to by head) and the data in the fragments.[1] Its value changes as the buffer moves from one network layer to the next, because headers are discarded while moving up in the stack and are added while moving down the stack. len accounts for protocol headers as well, as shown in Figure 2-8 in the section "Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull."

unsigned int data_len
unsigned int data_len

与 不同lendata_len只考虑片段中数据的大小。

Unlike len, data_len accounts only for the size of the data in the fragments.

unsigned int mac_len
unsigned int mac_len

这是 MAC 标头的大小。

This is the size of the MAC header.

atomic_t users
atomic_t users

这是引用计数,或使用此缓冲区的实体数量sk_buff。该参数的主要用途是避免sk_buff在有人仍在使用该结构时释放该结构。因此,缓冲区的每个用户都应在必要时递增和递减该字段。该计数器仅覆盖该数据结构的用户sk_buff;包含实际数据的缓冲区由类似的字段 ( dataref) 覆盖,该字段将在本章后面的“ skb_shared_info 结构和 skb_shinfo 函数”部分中介绍。

users有时直接用atomic_incatomic_dec函数递增和递减,但大多数时候它是用skb_get和操作的kfree_skb

This is the reference count, or the number of entities using this sk_buff buffer. The main use of this parameter is to avoid freeing the sk_buff structure when someone is still using it. For this reason, each user of the buffer should increment and decrement this field when necessary. This counter covers only the users of the sk_buff data structure; the buffer containing the actual data is covered by a similar field (dataref) that will be introduced later in the chapter, in the section "The skb_shared_info structure and the skb_shinfo function."

users is sometimes incremented and decremented directly with the atomic_inc and atomic_dec functions, but most of the time it is manipulated with skb_get and kfree_skb.

unsigned int truesize
unsigned int truesize

该字段表示缓冲区的总大小,包括sk_buff结构本身。它最初由函数设置alloc_skb为何len+sizeof(sk_buff)时为请求的字节数据空间分配缓冲区len

struct sk_buff *alloc_skb(无符号整型大小,int gfp_mask)
{
     …………
     skb->truesize = size + sizeof(struct sk_buff);
     …………
}

每当 skb->len 增加时,该字段就会更新。

This field represents the total size of the buffer, including the sk_buff structure itself. It is initially set by the function alloc_skb to len+sizeof(sk_buff) when the buffer is allocated for a requested data space of len bytes.

struct sk_buff *alloc_skb(unsigned int size,int gfp_mask)
{
     ... ... ...
     skb->truesize = size + sizeof(struct sk_buff);
     ... ... ...
}

The field gets updated whenever skb->len is increased.

unsigned char *head
unsigned char *head

unsigned char *end
unsigned char *end

unsigned char *data
unsigned char *data

unsigned char *tail
unsigned char *tail

它们代表缓冲区及其内数据的边界。当每一层为其活动准备缓冲区时,它可能会为标头或更多数据分配更多空间。headend指向分配给缓冲区的空间的开头和结尾,并且datatail指向实际数据的开头和结尾。见图2-2head然后,该层可以用协议头填充和 之间的间隙,或者用新数据填充和data之间的间隙。您将在后面的部分中看到“分配内存:alloc_skb 和 dev_alloc_skbtailend”表明图 2-2右侧的缓冲区在底部包含一个附加标头。

头/尾指针与数据/尾指针

图 2-2。头/尾指针与数据/尾指针

These represent the boundaries of the buffer and the data within it. When each layer prepares the buffer for its activities, it may allocate more space for a header or for more data. head and end point to the beginning and end of the space allocated to the buffer, and data and tail point to the beginning and end of the actual data. See Figure 2-2. The layer can then fill in the gap between head and data with a protocol header, or the gap between tail and end with new data. You will see in the later section "Allocating memory: alloc_skb and dev_alloc_skb" that the buffer on the right side of Figure 2-2 includes an additional header at the bottom.

Figure 2-2. head/end versus data/tail pointers

void (*destructor)(...)
void (*destructor)(...)

该函数指针可以初始化为在缓冲区被删除时执行某些活动的例程。当缓冲区不属于套接字时,析构函数通常不被初始化。当缓冲区属于套接字时,通常将其设置为sock_rfreesock_wfree(分别由skb_set_owner_rskb_set_owner_w初始化函数)。这两个 sock_ xxx例程用于更新套接字在其队列中保留的内存量。

This function pointer can be initialized to a routine that performs some activity when the buffer is removed. When the buffer does not belong to a socket, the destructor is usually not initialized. When the buffer belongs to a socket, it is usually set to sock_rfree or sock_wfree (by the skb_set_owner_r and skb_set_owner_w initialization functions, respectively). The two sock_ xxx routines are used to update the amount of memory held by the socket in its queues.

一般领域

General Fields

sk_buff本节涵盖了与特定内核功能无关的大部分字段 :

This section covers the majority of sk_buff fields, which are not associated with specific kernel features:

struct timeval stamp
struct timeval stamp

这通常仅对于接收到的数据包有意义。它是一个时间戳,表示何时接收数据包或(偶尔)何时安排传输数据包。它由函数netif_rxwith设置net_timestamp,设备驱动程序在接收到每个数据包后调用该函数,第 21 章对此进行了描述。

This is usually meaningful only for a received packet. It is a timestamp that represents when a packet was received or (occasionally) when one is scheduled for transmission. It is set by the function netif_rx with net_timestamp, which is called by the device driver after the reception of each packet and is described in Chapter 21.

struct net_device *dev
struct net_device *dev

该字段net_device描述了一个网络设备,其类型 ( ) 将在本章后面更详细地描述。所代表的设备的角色dev取决于存储在缓冲器中的数据包是即将被发送还是刚刚被接收。

当接收到数据包时,设备驱动程序会使用指向表示接收接口的数据结构的指针来更新此字段,如以下来自 3c59x 以太网卡系列驱动程序在接收帧时调用的函数的代码vortex_rx所示(在drivers/net/3c59x.c中):

静态 int vortex_rx(struct net_device *dev)
{
           …………
        skb->dev = dev;
           …………
        skb->协议 = eth_type_trans(skb, dev);
        netif_rx(skb); /* 将数据包传递给更高层 */
           …………
}

当要发送数据包时,该参数表示数据包将通过哪个设备发送出去。设置值的代码比接收数据包的代码更复杂,所以我将推迟到第21 章和第 35 章讨论。

某些网络功能允许将一些设备组合在一起以表示单个虚拟接口(即,不直接与硬件设备关联的接口),由虚拟设备驱动程序提供服务。当设备驱动程序被调用时,dev参数指向虚拟设备的net_device数据结构。驱动程序从其组中选择一个特定的设备,并更改参数dev以指向net_device该设备的数据结构。因此,在这些情况下,指向发送设备的指针可能在分组处理期间改变。

This field, whose type (net_device) will be described in more detail later in the chapter, describes a network device. The role of the device represented by dev depends on whether the packet stored in the buffer is about to be transmitted or has just been received.

When a packet is received, the device driver updates this field with the pointer to the data structure representing the receiving interface, as illustrated by the following piece of code from vortex_rx, the function called by the driver of the 3c59x Ethernet card series when receiving a frame (in drivers/net/3c59x.c):

static int vortex_rx(struct net_device *dev)
{
           ... ... ...
        skb->dev = dev;
           ... ... ...
        skb->protocol = eth_type_trans(skb, dev);
        netif_rx(skb); /* Pass the packet to the higher layer */
           ... ... ...
}

When a packet is to be transmitted, this parameter represents the device through which it will be sent out. The code that sets the value is more complicated than the code for receiving a packet, so I will postpone a discussion until Chapter 21 and Chapter 35.

Some network features allow a few devices to be grouped together to represent a single virtual interface (that is, one that is not directly associated with a hardware device), served by a virtual device driver. When the device driver is invoked, the dev parameter points to the virtual device's net_device data structure. The driver chooses a specific device from its group and changes the dev parameter to point to the net_device data structure of that device. Under these circumstances, therefore, the pointer to the transmitting device may be changed during packet processing.

struct net_device *input_dev
struct net_device *input_dev

这是接收数据包的设备。当数据包已在本地生成时,它是 NULL 指针。eth_type_trans对于以太网设备,它在(参见第 10章和第 13)中初始化。主要用于交通控制。

This is the device the packet has been received from. It is a NULL pointer when the packet has been generated locally. For Ethernet devices, it is initialized in eth_type_trans (see Chapters 10 and 13). It is used mainly by Traffic Control.

struct net_device *real_dev
struct net_device *real_dev

该字段仅对虚拟设备有意义,表示与虚拟设备关联的真实设备。例如,Bonding 和 VLAN 接口使用它来记住真实设备入口流量的来源。

This field is meaningful only for virtual devices, and represents the real device the virtual one is associated with. The Bonding and VLAN interfaces use it, for example, to remember where the real device ingress traffic is received from.

union {...} h
union {...} h

union {...} nh
union {...} nh

union {...} mac
union {...} mac

这些是指向 TCP/IP 堆栈协议头的指针:hL4、nhL3 和macL2。每个字段都指向各种结构的联合,该结构对应于该层内核所理解的每个协议。例如,h是一个联合,其中包含内核理解的每个 L4 协议的标头字段。每个联合体有一个成员被调用raw 并用于初始化;所有后续访问均通过特定于协议的成员进行。

当接收到数据包时,负责处理第n层报头的函数从第 n -1层接收一个缓冲区,该缓冲区skb->data 指向第n层报头的开头。处理第n层的函数初始化该层的正确指针(例如,skb->nh对于 L3 处理程序)以保留字段,因为当初始化为不同的偏移量skb->data时,该指针的内容将在下一层的处理过程中丢失skb->data缓冲区内。然后该函数完成该层 n处理,并在将数据包传递给第 n +1 层处理程序之前,进行更新skb->data以使其指向第 n层标头的末尾,即第 n +1 层标头的开头(见图2-3)。

发送数据包会逆转此过程,但会增加在每一层添加新标头的复杂性。

从第二层移动到第三层时标头的指针初始化

图 2-3。从第二层移动到第三层时标头的指针初始化

These are pointers to the protocol headers of the TCP/IP stack: h for L4, nh for L3, and mac for L2. Each field points to a union of various structures, one structure for each protocol understood by the kernel at that layer. For instance, h is a union that includes a field for the header of each L4 protocol understood by the kernel. One member of each union is called raw and is used for initialization; all later accesses are through the protocol-specific members.

When receiving a data packet, the function responsible for processing the layer n header receives a buffer from layer n−1 with skb->data pointing to the beginning of the layer n header. The function that handles layer n initializes the proper pointer for this layer (for instance, skb->nh for L3 handlers) to preserve the skb->data field, because the contents of this pointer will be lost during the processing at the next layer, when skb->data is initialized to a different offset within the buffer. The function then completes the layer n processing and, before passing the packet to the layer n+1 handler, updates skb->data to make it point to the end of the layer n header, which is the beginning of the layer n+1 header (see Figure 2-3).

Sending a packet reverses this process, with the added complexity of adding a new header at each layer.

Figure 2-3. Header's pointer initializations while moving from layer two to layer three

struct dst_entry dst
struct dst_entry dst

这由路由子系统使用。因为数据结构相当复杂,并且需要了解其他子系统如何工作,所以我将推迟到第七部分对其进行描述。

This is used by the routing subsystem. Because the data structure is quite complex and requires knowledge of how other subsystems work, I'll postpone a description of it until Part VII.

char cb[40]
char cb[40]

这是一个“控制缓冲区”,或私有信息的存储,由每一层维护以供内部使用。它在sk_buff结构内静态分配(当前大小为 40 字节),并且足够大以容纳每层所需的任何私有数据。在每一层的代码中,都是通过宏来进行访问,使代码更具可读性。例如,TCP 使用该空间来存储tcp_skb_cb数据结构,该数据结构在include/net/tcp.h中定义:

结构 tcp_skb_cb {
    …………
    _ _u32 序列;/* 起始序列号 */
    _ _u32 结束序列;/* SEQ + FIN + SYN + datalen */
    _ _u32 当;/* 用于计算rtt */
    _ _u8 标志;/* TCP 标头标志。*/
    …………
};

这是 TCP 代码用来访问该结构的宏。该宏仅由指针转换组成:

#define TCP_SKB_CB(_ _skb) ((struct tcp_skb_cb *)&((_ _skb)->cb[0]))

下面是 TCP 子系统在收到报文段后填充结构的示例:

int tcp_v4_rcv(结构sk_buff *skb)
{
        …………
        th = skb->h.th;
        TCP_SKB_CB(skb)->seq = ntohl(th->seq);
        TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
                                    skb->len - th->doff * 4);
        TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
        TCP_SKB_CB(skb)->当 = 0 时;
        TCP_SKB_CB(skb)->flags = skb->nh.iph->tos;
        TCP_SKB_CB(skb)->解雇 = 0;
        …………
}

要了解如何检索缓冲区中的参数cb,请查看net/ipv4/tcp_output.ctcp_transmit_skb中的函数。TCP 使用该函数将数据段推送到 IP 层进行传输。

第 22 章中,您还将看到 IPv4 如何用于cb存储有关 IP 分段的信息。

This is a "control buffer," or storage for private information, maintained by each layer for internal use. It is statically allocated within the sk_buff structure (currently with a size of 40 bytes) and is large enough to hold whatever private data is needed by each layer. In the code for each layer, access is done through macros to make the code more readable. TCP, for example, uses that space to store a tcp_skb_cb data structure, which is defined in include/net/tcp.h:

struct tcp_skb_cb {
    ... ... ...
    _ _u32        seq;        /* Starting sequence number */
    _ _u32        end_seq;    /* SEQ + FIN + SYN + datalen*/
    _ _u32        when;       /* used to compute rtt's    */
    _ _u8         flags;      /* TCP header flags.        */
    ... ... ...
};

And this is the macro used by the TCP code to access the structure. The macro consists simply of a pointer cast:

#define TCP_SKB_CB(_ _skb)    ((struct tcp_skb_cb *)&((_ _skb)->cb[0]))

Here is an example where the TCP subsystem fills in the structure upon receipt of a segment:

int tcp_v4_rcv(struct sk_buff *skb)
{
        ... ... ...
        th = skb->h.th;
        TCP_SKB_CB(skb)->seq = ntohl(th->seq);
        TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
                                    skb->len - th->doff * 4);
        TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
        TCP_SKB_CB(skb)->when = 0;
        TCP_SKB_CB(skb)->flags = skb->nh.iph->tos;
        TCP_SKB_CB(skb)->sacked = 0;
        ... ... ...
}

To see how the parameters in the cb buffer are retrieved, take a look at the function tcp_transmit_skb in net/ipv4/tcp_output.c. That function is used by TCP to push a data segment down to the IP layer for transmission.

In Chapter 22, you will also see how IPv4 uses cb to store information about IP fragmentation.

unsigned int csum
unsigned int csum

unsigned char ip_summed
unsigned char ip_summed

这些代表校验和和相关的状态标志。第 19 章描述了它们的使用。

These represent the checksum and associated status flag. Their use is described in Chapter 19.

unsigned char cloned
unsigned char cloned

一个boolean标志,设置后表明该结构是另一个缓冲区的克隆sk_buff请参阅后面的“克隆和复制缓冲区”部分。

A boolean flag that, when set, indicates that this structure is a clone of another sk_buff buffer. See the later section "Cloning and copying buffers."

unsigned char pkt_type
unsigned char pkt_type

该字段根据 L2 目标地址对帧类型进行分类。可能的值列在include/linux/if_packet.h中。对于以太网设备,该参数由函数 初始化,这在第 13 章eth_type_trans中进行了描述。

它可以分配的主要值是:

PACKET_HOST

接收帧的目的地址是接收接口的地址;换句话说,数据包已到达目的地。

PACKET_MULTICAST

接收到的帧的目标地址是接口注册到的多播地址之一。

PACKET_BROADCAST

接收到的帧的目的地址是接收接口的广播地址。

PACKET_OTHERHOST

接收帧的目的地址不属于与接口关联的地址(单播、组播、广播);因此,如果启用了转发,则必须转发该帧,否则将丢弃该帧。

PACKET_OUTGOING

数据包正在发送中;该标志的用户包括 Decnet 协议和为每个网络分路器提供传出数据包副本的功能(请参阅dev_queue_xmit_nit11 章)。

PACKET_LOOPBACK

数据包正在发送到环回设备。由于这个标志,在处理回送设备时,内核可以跳过一些实际设备所需的操作。

PACKET_FASTROUTE

数据包正在使用快速路由功能进行路由。Fastroute 支持在 2.6 内核中不再可用。

第 13 章详细介绍了如何根据 L2 目标地址值设置这些值。

This field classifies the type of frame based on its L2 destination address. The possible values are listed in include/linux/if_packet.h. For Ethernet devices, this parameter is initialized by the function eth_type_trans, which is described in Chapter 13.

The main values it can be assigned are:

PACKET_HOST

The destination address of the received frame is that of the receiving interface; in other words, the packet has reached its destination.

PACKET_MULTICAST

The destination address of the received frame is one of the multicast addresses to which the interface is registered.

PACKET_BROADCAST

The destination address of the received frame is the broadcast address of the receiving interface.

PACKET_OTHERHOST

The destination address of the received frame does not belong to the ones associated with the interface (unicast, multicast, and broadcast); thus, the frame will have to be forwarded if forwarding is enabled, and dropped otherwise.

PACKET_OUTGOING

The packet is being sent out; among the users of this flag are the Decnet protocol and the function that gives each network tap a copy of the outgoing packet (see dev_queue_xmit_nit in Chapter 11).

PACKET_LOOPBACK

The packet is being sent out to the loopback device. Thanks to this flag, when dealing with the loopback device, the kernel can skip some operations needed for real devices.

PACKET_FASTROUTE

The packet is being routed using the Fastroute feature. Fastroute support is not available anymore in 2.6 kernels.

Chapter 13 details how those values are set based on the L2 destination address value.

_ _u32 priority
_ _u32 priority

这指示正在传输或转发的数据包的服务质量 (QoS) 类别。如果数据包是本地生成的,则套接字层定义该priority值。相反,如果正在转发数据包,则该函数rt_tos2priority(从该ip_forward函数调用)根据 IP 标头本身中的服务类型 (ToS) 字段的值来定义该字段的值。该参数的值与第18章中描述的DiffServ代码点(DSCP)无关 。我将在第 20 章的“ ip_forward 函数”部分讨论它的作用。

This indicates the Quality of Service (QoS) class of a packet being transmitted or forwarded. If the packet is generated locally, the socket layer defines the priority value. If instead the packet is being forwarded, the function rt_tos2priority (called from the ip_forward function) defines the value of the field according to the value of the Type of Service (ToS) field in the IP header itself. The value of this parameter has nothing to do with the DiffServ Code Point (DSCP) described in Chapter 18. I will discuss its role in the section "ip_forward Function" in Chapter 20.

unsigned short protocol
unsigned short protocol

从 L2 设备驱动程序的角度来看,这是下一个更高层使用的协议。这里列出的典型协议有 IP、IPv6 和 ARP;完整的列表可以在include/linux/if_ether.h中找到。由于每个协议都有自己的函数处理程序来处理传入数据包,因此驱动程序使用此字段来通知其上方的层要使用哪个处理程序。每个驱动程序都会netif_rx调用上层网络层的处理程序,因此protocol必须在调用该函数之前初始化该字段。详细信息请参见第 10 章和13章。

This is the protocol used at the next-higher layer from the perspective of the device driver at L2. Typical protocols listed here are IP, IPv6, and ARP; a complete list is available in include/linux/if_ether.h. Since each protocol has its own function handler for the processing of incoming packets, this field is used by the driver to inform the layer above it what handler to use. Each driver calls netif_rx to invoke the handler for the upper network layer, so the protocol field must be initialized before that function is invoked. See Chapters 10 and 13 for more detail.

unsigned short security
unsigned short security

这是数据包的安全级别。该字段最初是为了与 IPsec 一起使用而引入的,但已不再使用。

This is the security level of the packet. This field was originally introduced for use with IPsec but is no longer used.

特定功能字段

Feature-Specific Fields

Linux 内核是模块化的,允许您选择包含哪些内容以及排除哪些内容。sk_buff因此,仅当内核编译为支持特定功能(例如防火墙(Netfilter)或 QoS)时,某些字段才会包含在数据结构中:

The Linux kernel is modular, allowing you to select what to include and what to leave out. Thus, some fields are included in the sk_buff data structure only if the kernel is compiled with support for particular features such as firewalling (Netfilter) or QoS:

unsigned long nfmark
unsigned long nfmark

_ _u32 nfcache
_ _u32 nfcache

_ _u32 nfctinfo
_ _u32 nfctinfo

struct nf_conntrack *nfct
struct nf_conntrack *nfct

unsigned int nfdebug
unsigned int nfdebug

struct nf_bridge_info *nf_bridge
struct nf_bridge_info *nf_bridge

这些参数由 Netfilter(防火墙代码)使用,更具体地说,由内核选项“设备驱动程序 → 网络支持 → 网络选项 → 网络数据包过滤”及其两个子选项“网络数据包过滤调试”和“桥接 IP/ARP”使用数据包过滤。”

These parameters are used by Netfilter (the firewall code), and more specifically by the kernel option "Device Drivers → Networking support → Networking options → Network packet filtering" and its two suboptions, "Network packet filtering debugging" and "Bridged IP/ARP packets filtering."

union {...} private
union {...} private

该联合由高性能并行接口 (HIPPI) 使用。关联的内核选项是“设备驱动程序→网络支持→网络设备支持→HIPPI驱动程序支持”。

This union is used by the High Performance Parallel Interface (HIPPI). The associated kernel option is "Device Drivers → Networking support → Network device support → HIPPI driver support."

_ _u32 tc_index
_ _u32 tc_index

_ _u32 tc_verd
_ _u32 tc_verd

_ _u32 tc_classid
_ _u32 tc_classid

这些参数由流量控制使用,更具体地说,由内核选项“设备驱动程序 → 网络支持 → 网络选项 → QoS 和/或公平队列”及其子选项“数据包分类器 API”使用。

These parameters are used by the Traffic Control, and more specifically by the kernel option "Device Drivers → Networking support → Networking options → QoS and/or fair queueing" and its suboption, "Packet classifier API."

struct sec_path *sp
struct sec_path *sp

IPsec 协议套件使用它来跟踪转换。

This is used by the IPsec protocol suite to keep track of transformations.

管理职能

Management Functions

功能很多,通常非常简短,由内核提供来操作sk_buff元素或元素列表。在图 2-4的帮助下 ,我将描述最重要的几个。首先,我们将看到用于分配和释放缓冲区的函数,然后是用于操作指针(即skb->data)以在帧的头部或尾部保留空间的函数。

Lots of functions , usually very short and simple, are offered by the kernel to manipulate sk_buff elements or lists of elements. With the help of Figure 2-4, I'll describe the most important ones. First we will see the functions used to allocate and free buffers, and then the ones used to manipulate the pointers (i.e., skb->data) to reserve space at the head or at the tail of a frame.

如果您查看文件include/linux/skbuff.hnet/core/skbuff.c,您会注意到几乎所有函数都存在两个版本,名称类似于do_something_ _do_something。通常,第一个是一个包装器,它在对第二个的调用周围添加额外的健全性检查或锁定机制。通常不直接调用内部_ _do_something形式(除非满足特定条件,例如,锁定要求)。该规则的例外情况通常是编码不良的函数,这些函数最终将被修复。

If you take a look at the files include/linux/skbuff.h and net/core/skbuff.c, you will notice that almost all of the functions exist in two versions, with names like do_something and _ _do_something. Usually, the first one is a wrapper that adds extra sanity checks or locking mechanisms around a call to the second one. The internal _ _do_something form is generally not called directly (unless specific conditions are met—i.e., lock requirements, to name one). Exceptions to that rule are usually poorly coded functions that will be fixed eventually.

前后:(a)skb_put、(b)skb_push、(c)skb_pull 和 (d)skb_reserve

图 2-4。前后:(a)skb_put、(b)skb_push、(c)skb_pull 和 (d)skb_reserve

Figure 2-4. Before and after: (a)skb_put, (b)skb_push, (c)skb_pull, and (d)skb_reserve

分配内存:alloc_skb 和 dev_alloc_skb

Allocating memory: alloc_skb and dev_alloc_skb

alloc_skb是分配缓冲区的主要函数,在net/core/skbuff.c中定义。我们已经看到,数据缓冲区和标头(sk_buff数据结构)是两个不同的实体,这意味着创建单个缓冲区涉及两次内存分配(一项用于缓冲区,一项用于结构sk_buff )。

alloc_skb is the main function for the allocation of buffers and is defined in net/core/skbuff.c. We have already seen that the data buffer and the header (the sk_buff data structure) are two different entities, which means that creating a single buffer involves two allocations of memory (one for the buffer and one for the sk_buff structure).

alloc_skbsk_buff通过调用 函数 从缓存中获取 数据结构kmem_cache_alloc,并通过调用 获取数据缓冲区kmalloc,如果缓存内存可用,它也会使用缓存内存。代码(稍微简化)是:

alloc_skb takes an sk_buff data structure from a cache by calling the function kmem_cache_alloc, and gets a data buffer by calling kmalloc, which also uses cached memory if it is available. The code (slightly simplified) is:

    skb = kmem_cache_alloc(skbuff_head_cache, gfp_mask & ~_ _GFP_DMA);
    …………
    大小 = SKB_DATA_ALIGN(大小);
    数据 = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
    skb = kmem_cache_alloc(skbuff_head_cache, gfp_mask & ~_ _GFP_DMA);
    ... ... ...
    size = SKB_DATA_ALIGN(size);
    data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);

在调用之前kmallocsize使用宏调整参数SKB_DATA_ALIGN以强制对齐。在返回之前,该函数会初始化结构体中的一些参数,产生 如图 2-5所示的最终结果。

Before calling kmalloc, the size parameter is tuned with the macro SKB_DATA_ALIGN to force alignment. Before returning, the function initializes a few parameters in the structure, producing the final result shown in Figure 2-5.

在图 2-5右侧内存块的底部,您可以看到为强制对齐而引入的填充区域。该skb_shared_info块主要用于处理IP 分片,本章稍后将对此进行介绍。前面已经解释了图左侧显示的字段。

At the bottom of the memory block on the right side of Figure 2-5 you can see the padding area introduced to force the alignment. The skb_shared_info block is mainly used to handle IP fragments and is described later in this chapter. The fields shown on the left side of the figure were explained earlier.

alloc_skb函数

图 2-5。alloc_skb函数

Figure 2-5. alloc_skb function

dev_alloc_skb是供设备驱动程序使用的缓冲区分配函数,预计在中断模式下执行。它只是一个包装器alloc_skb,出于优化原因将 16 个字节添加到请求的大小,并要求原子操作 ( GFP_ATOMIC),因为它将从中断处理程序例程中调用:

dev_alloc_skb is the buffer allocation function meant for use by device drivers and expected to be executed in interrupt mode. It is simply a wrapper around alloc_skb that adds 16 bytes to the requested size for optimization reasons and asks for an atomic operation (GFP_ATOMIC) since it will be called from within an interrupt handler routine:

静态内联结构体 sk_buff *dev_alloc_skb(unsigned int length)
{
    返回 _ _dev_alloc_skb(长度, GFP_ATOMIC);
}


静态内联
struct sk_buff *_ _dev_alloc_skb(unsigned int length, int gfp_mask)
{
    struct sk_buff *skb = alloc_skb(长度 + 16, gfp_mask);
    如果(可能(skb))
            skb_reserve(skb, 16);
    返回skb;
}
static inline struct sk_buff *dev_alloc_skb(unsigned int length)
{
    return _ _dev_alloc_skb(length, GFP_ATOMIC);
}


static inline
struct sk_buff *_ _dev_alloc_skb(unsigned int length, int gfp_mask)
{
    struct sk_buff *skb = alloc_skb(length + 16, gfp_mask);
    if (likely(skb))
            skb_reserve(skb, 16);
    return skb;
}

_ _dev_alloc_skb当没有特定于体系结构的定义时,此定义是默认使用的定义。

This definition of _ _dev_alloc_skb is the default one used when there is no architecture-specific definition.

释放内存:kfree_skb 和 dev_kfree_skb

Freeing memory: kfree_skb and dev_kfree_skb

这两个函数释放一个缓冲区,从而将其返回到缓冲池(缓存)。 kfree_skb既可以直接调用,也可以通过dev_kfree_skb包装器调用。后者被定义供设备驱动程序使用,其名称与此类似,dev_alloc_skb但由一个简单的宏组成,该宏除了调用之外什么也不做kfree_skbskb->users 此基本函数仅在计数器为 1 时(即没有留下缓冲区的用户时)释放缓冲区。否则,该函数只是递减该计数器。因此,如果缓冲区有三个用户,则只有第三次调用 dev_kfree_skbkfree_skb才会释放内存。

These two functions release a buffer, which results in its return to the buffer pool (cache). kfree_skb is both called directly and invoked through the dev_kfree_skb wrapper. The latter is defined for use by device drivers, to have a name that parallels dev_alloc_skb but consists of a simple macro that does nothing but call kfree_skb. This basic function releases a buffer only when the skb->users counter is 1 (when no users of the buffer are left). Otherwise, the function simply decrements that counter. So if a buffer had three users, only the third call to dev_kfree_skb or kfree_skb would free memory.

图2-6中的流程图显示了释放所涉及的所有步骤 一个缓冲区。正如您将在第 33 章中看到的,sk_buff结构可以保存数据结构的引用dst_entry。因此,当sk_buff结构被释放时,dst_release还必须调用以减少关联dst_entry数据结构上的引用计数。

The flowchart in Figure 2-6 shows all the steps involved in freeing a buffer. As you will see in Chapter 33, an sk_buff structure can hold a reference on a dst_entry data structure. When the sk_buff structure is freed, therefore, dst_release also has to be called to decrement the reference count on the associated dst_entry data structure.

当函数指针被初始化后,它会在这里被调用(参见本章前面的“布局字段destructor”部分)。

When the destructor function pointer has been initialized, it is called here (see the section "Layout Fields" earlier in this chapter).

我们在图 2-5中看到了一个简单的场景:一个sk_buff数据结构与存储实际数据的另一个内存块相关联。然而,如图2-5skb_shared_info 所示,该数据块底部的数据结构可以保存指向其他内存片段的指针。请参阅第 21 章的一些示例。当这些片段存在时,也会释放它们所持有的内存。最后将数据结构返回到缓存中。kfree_skbsk_buffskbuff_head_cache

We have seen in Figure 2-5 what a simple scenario looks like: an sk_buff data structure is associated to another memory block where the actual data is stored. However, the skb_shared_info data structure at the bottom of that data block, as shown in Figure 2-5, can hold pointers to other memory fragments. See Chapter 21 for some examples. kfree_skb releases the memory held by those fragments as well, when they are present. Finally, the sk_buff data structure is returned to the skbuff_head_cache cache.

数据保留和对齐:skb_reserve、skb_put、skb_push 和 skb_pull

Data reservation and alignment: skb_reserve, skb_put, skb_push, and skb_pull

skb_reserve在缓冲区的头部保留一些空间(净空),通常用于允许插入标头或强制数据在某些边界上对齐。该函数移动分别标记有效负载的开始和结束的 data和指针(在前面的“布局字段tail”部分中讨论)。图2-4(d)显示了调用的结果。该函数通常在分配缓冲区后不久调用,此时和仍然相同。skb_reserve(skb,n)datatail

skb_reserve reserves some space (headroom) at the head of the buffer and is commonly used to allow the insertion of a header or to force data to be aligned on some boundary. The function shifts the data and tail pointers (discussed earlier in the section "Layout Fields") that mark the beginning and the end of the payload, respectively. Figure 2-4(d) shows the result of calling skb_reserve(skb,n). This function is usually called soon after the allocation of the buffer, when data and tail are still the same.

如果您查看其中一个以太网驱动程序的接收函数(例如, vortex_rxdrivers/net/3c59x.c中),您将看到它们在将任何数据存储到刚刚分配的缓冲区中之前都使用以下命令:

If you look at the receive function of one of the Ethernet drivers (for instance, vortex_rx in drivers/net/3c59x.c) you will see that they all use the following command before storing any data in the buffer they have just allocated:

skb_reserve(skb, 2); /* 在 16 字节边界上对齐 IP */
skb_reserve(skb, 2);    /* Align IP on 16 byte boundaries */
kfree_skb 函数

图 2-6。kfree_skb 函数

Figure 2-6. kfree_skb function

因为他们知道他们即将将一个具有 14 个八位字节长的报头的以太网帧复制到缓冲区中,所以参数将2 缓冲区的头部移动 2 个字节。这使得紧随以太网报头之后的 IP 报头从缓冲区的开头开始在 16 字节边界上对齐,如图2-7所示。

Because they know that they are about to copy an Ethernet frame that has a header 14 octets long into the buffer, the argument of 2 shifts the head of the buffer 2 bytes. This keeps the IP header, which follows immediately after the Ethernet header, aligned on a 16-byte boundary from the beginning of the buffer, as shown in Figure 2-7.

(a) skb_reserve 之前,(b) skb_reserve 之后,(c) 将帧复制到缓冲区之后

图 2-7。(a) skb_reserve 之前,(b) skb_reserve 之后,(c) 将帧复制到缓冲区之后

Figure 2-7. (a) before skb_reserve, (b) after skb_reserve, and (c) after copying the frame on the buffer

图 2-8skb_reserve显示了数据传输过程中反向使用的示例。

Figure 2-8 shows an example of using skb_reserve in the opposite direction, during data transmission.

从 TCP 层向下遍历堆栈到链路层时填充的缓冲区

图 2-8。从 TCP 层向下遍历堆栈到链路层时填充的缓冲区

Figure 2-8. Buffer that is filled in while traversing the stack from the TCP layer down to the link layer

  1. 当 TCP 被要求传输某些数据时,它会按照特定标准(TCP 最大段大小 (mss)、支持分散收集 I/O 等)分配缓冲区。

  2. When TCP is asked to transmit some data, it allocates a buffer following certain criteria (TCP Maximum Segment Size (mss), support for scatter gather I/O, etc.).

  3. TCPskb_reserve在缓冲区的头部保留(用 )足够的空间来容纳所有层(TCP、IP、链路层)的所有标头。该参数MAX_TCP_HEADER是各级所有报头的总和,并且在计算时考虑了最坏的情况:因为TCP层不知道将使用什么类型的接口进行传输,所以它为每一层保留最大可能的报头。它甚至考虑了多个 IP 标头的可能性(因为当内核编译为支持 IP over IP 时,您可以拥有多个 IP 标头)。

  4. TCP reserves (with skb_reserve) enough space at the head of the buffer to hold all the headers of all layers (TCP, IP, link layer). The parameter MAX_TCP_HEADER is the sum of all headers of all levels and is calculated taking into account the worst-case scenarios: because the TCP layer does not know what type of interface will be used for the transmission, it reserves the biggest possible header for each layer. It even accounts for the possibility of multiple IP headers (because you can have multiple IP headers when the kernel is compiled with support for IP over IP).

  5. TCP 有效负载被复制到缓冲区中。请注意,图 2-8只是一个示例。TCP 有效负载可以以不同的方式组织;例如,它可以存储为片段。在第 21 章中,我们将看到分段缓冲区(通常也称为分页缓冲区)是什么样子。

  6. The TCP payload is copied into the buffer. Note that Figure 2-8 is just an example. The TCP payload could be organized differently; for example, it could be stored as fragments. In Chapter 21, we will see what a fragmented buffer (also commonly called a paged buffer) looks like.

  7. TCP 层添加其标头。

  8. The TCP layer adds its header.

  9. TCP 层将缓冲区交给 IP 层,IP 层也添加其标头。

  10. The TCP layer hands the buffer to the IP layer, which adds its header as well.

  11. IP 层将 IP 数据包传递给相邻层,相邻层添加链路层标头。

  12. The IP layer hands the IP packet to the neighboring layer, which adds the link layer header.

请注意,当缓冲区沿着网络堆栈向下移动时,每个协议都会 skb->data向下移动,复制其标头并更新skb->len所有这些都是通过我们在图 2-4中看到的函数来完成的。

Note that while the buffer travels down the network stack, each protocol moves skb->data down, copies in its header, and updates skb->len. All of this is accomplished with the functions we saw in Figure 2-4.

请注意,该skb_reserve函数实际上并没有将任何内容移入数据缓冲区或数据缓冲区内。它只是更新两个指针,如图2-4(d)所示。

Note that the skb_reserve function does not really move anything into or within the data buffer; it simply updates the two pointers as depicted in Figure 2-4(d).

静态内联 void skb_reserve(struct sk_buff *skb, unsigned int len)
{
    skb->data+=len;
    skb->tail+=len;
}
static inline void skb_reserve(struct sk_buff *skb, unsigned int len)
{
    skb->data+=len;
    skb->tail+=len;
}

skb_push在缓冲区的开头添加一个数据块,并skb_put在末尾添加一个数据块。就像skb_reserve,这些函数实际上并没有向缓冲区添加任何数据;而是将它们添加到缓冲区中。他们只需将指针移动到其头部或尾部即可。新数据应该由其他函数显式复制。skb_pull通过向前移动指针从缓冲区头部删除数据块head图 2-4显示了这些函数的工作原理。

skb_push adds one block of data to the beginning of the buffer, and skb_put adds one to the end. Like skb_reserve, these functions don't really add any data to the buffer; they simply move the pointers to its head or tail. The new data is supposed to be copied explicitly by other functions. skb_pull removes a block of data from the head of the buffer by moving the head pointer forward. Figure 2-4 shows how these functions work.

skb_shared_info结构和skb_shinfo函数

The skb_shared_info structure and the skb_shinfo function

如图2-5所示,skb_shared_info数据缓冲区末尾有一个被调用的结构体 ,用于保存有关数据块的附加信息。数据结构紧跟在end标记数据结尾的指针之后。这是数据结构的定义:

As shown in Figure 2-5, there is a structure called skb_shared_info at the end of the data buffer that keeps additional information about the data block. The data structure immediately follows the end pointer that marks the end of the data. This is the definition of the data structure:

结构体skb_shared_info {
    原子_t 数据引用;
    无符号整数nr_frags;
    无符号短 tso_size;
    无符号短 tso_seqs;
    结构 sk_buff *frag_list;
    skb_frag_t 碎片[MAX_SKB_FRAGS];
};
struct skb_shared_info {
    atomic_t        dataref;
    unsigned int    nr_frags;
    unsigned short  tso_size;
    unsigned short  tso_seqs;
    struct sk_buff  *frag_list;
    skb_frag_t      frags[MAX_SKB_FRAGS];
};

dataref表示数据块的“用户”数量,将在下一节“克隆和复制缓冲区”中进行描述。nr_fragsfrag_list、 和frags用于处理IP 片段,并在第21 章中进行描述。该skb_is_nonlinear例程可用于检查缓冲区是否有碎片,并且skb_linearize [ 1 ]可用于将碎片折叠成单个平面缓冲区。折叠片段涉及复制,这会带来性能损失。

dataref represents the number of "users" of the data block and is described in the next section, "Cloning and copying buffers." nr_frags, frag_list, and frags are used to handle IP fragments and are described in Chapter 21. The skb_is_nonlinear routine can be used to check whether the buffer is fragmented, and skb_linearize [1] can be used to collapse the fragments into a single flat buffer. Collapsing the fragments involves copying, which introduces a performance penalty.

一些网络接口卡 (NIC) 可以在硬件中处理一些传统上由 CPU 完成的任务。最常见的例子是 L3 和 L4 校验和的计算。有些 NIC 甚至可以维护 L4 协议的状态机。对于此处显示的代码,我们对 TCP 分段卸载感兴趣,其中 NIC 实现了 TCP 层的子集。tso_size并被tso_seqs 此功能所使用。

Some network interface cards (NICs) can handle in hardware some of the tasks that have traditionally been done by the CPU. The most common example is the computation of the L3 and L4 checksums. Some NICs can even maintain the L4 protocol's state machines. For the sake of the code shown here, we are interested in TCP segmentation offload, where the NIC implements a subset of the TCP layer. tso_size and tso_seqs are used by this feature.

请注意,结构内部没有sk_buff指向skb_shared_info数据结构的字段。要访问该结构,函数需要使用skb_shinfo宏,它只返回end 指针:

Note that there is no field inside the sk_buff structure pointing at the skb_shared_info data structure. To access that structure, functions need to use the skb_shinfo macro, which simply returns the end pointer:

#define skb_shinfo(SKB) ((struct skb_shared_info *)((SKB)->end))
#define skb_shinfo(SKB)    ((struct skb_shared_info *)((SKB)->end))

例如,以下语句显示了如何使用宏来递增私有块的字段:

The following statement, for instance, shows how the macro is used to increment a field of the private block:

skb_shinfo(skb)->nr_frags++;
skb_shinfo(skb)->nr_frags++;

克隆和复制缓冲区

Cloning and copying buffers

当同一个缓冲区需要由不同的消费者独立处理时,并且他们可能需要更改描述符的内容 sk_buffhnh指向协议头的指针),内核不需要对结构sk_buff和协议头进行完整的复制。相关的数据缓冲区。相反,为了提高效率,内核可以克隆原始数据,这包括仅复制结构sk_buff 并使用引用计数以避免过早释放共享数据块。缓冲区克隆是通过该skb_clone函数完成的。

When the same buffer needs to be processed independently by different consumers, and they may need to change the content of the sk_buff descriptor (the h and nh pointers to the protocol headers), the kernel does not need to make a complete copy of both the sk_buff structure and the associated data buffers. Instead, to be more efficient, the kernel can clone the original, which consists of making a copy of the sk_buff structure only and playing with the reference counts to avoid releasing the shared data block prematurely. Buffer cloning is done with the skb_clone function.

使用克隆的情况的一个示例是,需要将入口数据包传递给多个接收者,例如协议处理程序和一个或多个网络分流器(请参阅第 21 章

An example of a situation using cloning is when an ingress packet needs to be delivered to multiple recipients, such as the protocol handler and one or more network taps (see Chapter 21).

克隆sk_buff不链接到任何列表,也不引用套接字所有者。该字段skb->cloned在克隆缓冲区和原始缓冲区中均设置为 1。 skb->users在克隆中设置为 1,以便第一次尝试删除它会成功,并且对包含数据的缓冲区的引用 () 数量dataref会增加(因为现在又有一个sk_buff数据结构指向它)。图 2-9显示了克隆缓冲区的示例。

The sk_buff clone is not linked to any list and has no reference to the socket owner. The field skb->cloned is set to 1 in both the clone and the original buffer. skb->users is set to 1 in the clone so that the first attempt to remove it succeeds, and the number of references (dataref) to the buffer containing the data is incremented (since now there is one more sk_buff data structure pointing to it). Figure 2-9 shows an example of a cloned buffer.

skb_clone函数

图 2-9。skb_clone函数

Figure 2-9. skb_clone function

skb_clone例程可用于检查skb 缓冲区的克隆状态。

The skb_clone routine can be used to check the cloned status of an skb buffer.

图 2-9 显示了分段缓冲区的示例,也就是说,缓冲区中存储了一些与数组链接的数据片段中的数据frags我们将在第 21 章中看到如何使用分段缓冲区;现在,我们先不要理会这些细节。

Figure 2-9 shows an example of a fragmented buffer—that is to say, a buffer that has some data stored in data fragments linked with the frags array. We will see how fragmented buffers are used in Chapter 21; for now, let's not bother with those details.

skb_share_check 例程可用于检查引用计数skb->users并在字段表明缓冲区是共享skb 时克隆users缓冲区。

The skb_share_check routine can be used to check the reference count skb->users and clone the buffer skb when the users field says the buffer is shared.

当缓冲区被克隆时,数据块的内容不能被修改。这意味着代码可以访问数据而无需锁定。然而,当函数不仅需要修改结构的内容sk_buff还需要修改数据时,它还需要克隆数据块。在这种情况下,程序员有两个选择。skb->start当他知道他只需要修改和之间的区域中的数据内容时skb->end,他可以使用pskb_copy克隆该区域。当他认为他可能也需要修改片段数据块的内容时,他必须使用skb_copy. 两者的结果pskb_copy如图2-10skb_copy所示您将在第 21 章中看到 ,skb_shared_info数据结构也可以包含结构列表sk_buff(链接到名为 的字段frag_list)。该列表由数组处理pskb_copy,处理skb_copy 方式与数组相同(图 2-10frags中省略了此细节,以保持后者更具可读性)。

When a buffer is cloned, the contents of the data block cannot be modified. This means that code can access the data without any need for locking. When, however, a function needs to modify not only the contents of the sk_buff structure but the data too, it needs to clone the data block as well. In this case, the programmer has two options. When he knows he needs to modify only the contents of the data in the area between skb->start and skb->end, he can use pskb_copy to clone just that area. When he thinks he may need to modify the content of the fragment data blocks too, he must use skb_copy. The result of both pskb_copy and skb_copy is shown in Figure 2-10. You will see in Chapter 21 that the skb_shared_info data structure can include a list of sk_buff structures too (linked to a field called frag_list). That list is handled by pskb_copy and skb_copy in the same way as the frags array (this detail has been omitted from Figure 2-10 to keep the latter more readable).

(a) pskb_copy 函数和 (b) skb_copy 函数

图 2-10。(a) pskb_copy 函数和 (b) skb_copy 函数

Figure 2-10. (a) pskb_copy function and (b) skb_copy function

此时您可能无法理解 图 2-92-10中的所有细节。在本书的后面部分,尤其是当您读完第五部分后,一切都会变得更有意义。

You may not be able to appreciate all of the details in Figures 2-9 and 2-10 at this point. Later in the book, especially once you have gone through Part V, everything will make more sense.

在讨论本书的各个主题时,我有时会强调给定的函数需要克隆或复制缓冲区。当决定克隆或复制缓冲区时,每个子系统的程序员无法预测其他内核组件(或其子系统的其他用户)是否需要该缓冲区中的原始信息。内核是非常模块化的,并且以非常动态和不可预测的方式进行更改,因此每个子系统都不知道其他子系统可能会使用缓冲区做什么。因此,每个子系统的程序员只需跟踪他们对缓冲区所做的任何修改,

While discussing the various topics of this book, I will sometimes emphasize that a given function needs to clone or copy a buffer. When deciding to make a clone or copy of a buffer, programmers of each subsystem cannot anticipate whether other kernel components (or other users of their subsystems) will need the original information in that buffer. The kernel is very modular and changes in a very dynamic and unpredictable way, so each subsystem is ignorant of what other subsystems may do with a buffer. Therefore, the programmers of each subsystem just keep track of any modifications they make to the buffer, and take care to make a copy before modifying anything in case some other part of the kernel needs the original information.

列表管理功能

List management functions

这些函数操作元素列表sk_buff,也称为队列。有关函数的完整列表,请参阅<include/linux/skbuff.h><net/core/skbuff.c>。一些最常用的功能是:

These functions manipulate the lists of sk_buff elements, also called queues. For a complete list of functions, see <include/linux/skbuff.h> and <net/core/skbuff.c>. Some of the most commonly used functions are:

skb_queue_head_init
skb_queue_head_init

sk_buff_head使用空元素队列初始化 an 。

Initializes an sk_buff_head with an empty queue of elements.

skb_queue_head,skb_queue_tail
skb_queue_head, skb_queue_tail

分别向队列的头部或尾部添加一个缓冲区。

Adds one buffer to the head or to the tail of a queue, respectively.

skb_dequeue,skb_dequeue_tail
skb_dequeue, skb_dequeue_tail

分别从头部或尾部使元素出列。第二个函数可能应该被调用skb_dequeue_head以与其他排队函数的名称保持一致。

Dequeues an element from the head or from the tail, respectively. The second function should probably have been called skb_dequeue_head to be consistent with the names of the other queueing functions.

skb_queue_purge
skb_queue_purge

清空队列。

Empties a queue.

skb_queue_walk
skb_queue_walk

依次对队列的每个元素运行循环。

Runs a loop on each element of a queue in turn.

此类的所有函数都必须以原子方式执行,即它们必须获取sk_buff_head队列结构提供的自旋锁。否则,它们可能会被将元素从队列中入队或出队的异步事件中断,例如由过期计时器调用的函数,这将导致竞争条件。

All functions of this class must be executed atomically—that is, they must grab the spin lock provided by the sk_buff_head structure for the queue. Otherwise, they could be interrupted by asynchronous events that enqueue or dequeue elements from the queues, such as functions invoked by expired timers, which would lead to race conditions.

因此,每个函数的实现如下:

Thus, each function is implemented as follows:

静态内联function_name( parameter_list)
{
        无符号长标志;

        spin_lock_irqsave(...);
        _ _ _ function_name( parameter_list)
        spin_unlock_irqrestore(...);
}
static inline function_name ( parameter_list )
{
        unsigned long flags;

        spin_lock_irqsave(...);
        _ _ _function_name ( parameter_list )
        spin_unlock_irqrestore(...);
}

该函数由一个包装器组成,该包装器获取锁,通过调用名称以两个下划线开头的函数来完成其工作,然后释放锁。

The function consists of a wrapper that grabs the lock, does its work by invoking a function whose name begins with two underscores, and releases the lock.

网络设备结构

net_device Structure

net_device数据结构存储具体关于网络设备的所有信息。每个设备都有一个这样的结构,包括真实设备(例如以太网 NIC)和虚拟设备(例如绑定[ 1 ]或 VLAN [ ])。在本节中,我将互换使用“接口”“设备”这两个词,尽管它们之间的区别在其他上下文中很重要。

The net_device data structure stores all information specifically regarding a network device. There is one such structure for each device, both real ones (such as Ethernet NICs) and virtual ones (such as bonding[1] or VLAN[]). In this section, I will use the words interface and device interchangeably, even though the difference between them is important in other contexts.

所有设备的结构net_device都放入全局变量指向的全局列表中dev_base数据结构在include/linux/netdevice.h中定义。网络设备的注册在第8章中描述。net_device在该章中,您可以找到有关如何以及何时初始化大多数字段的详细信息。

The net_device structures for all devices are put into a global list to which the global variable dev_base points. The data structure is defined in include/linux/netdevice.h. The registration of network devices is described in Chapter 8. In that chapter, you can find details on how and when most of the net_device fields are initialized.

像 一样sk_buff,这个结构相当大,包含许多特定于功能的参数,以及来自许多不同层的参数。因此,出于优化原因,结构的整体组织可能很快就会发生一些变化。

Like sk_buff, this structure is quite big and includes many feature-specific parameters, along with parameters from many different layers. For this reason, the overall organization of the structure will probably see some changes soon for optimization reasons.

网络设备可以分为以太网卡和令牌环卡等类型。虽然该结构的某些字段net_device对于同一类型的所有设备设置为相同的值,但某些字段必须根据设备的每个型号进行不同的设置。因此,对于几乎每种类型,Linux 都提供了一个通用函数来初始化参数,这些参数的值在所有模型中保持相同。每个设备驱动程序除了设置其模型具有唯一值的字段之外,还调用此函数。驱动程序还可以覆盖内核已初始化的字段(例如,为了提高性能)。您可以在第 8 章中找到更多详细信息。

Network devices can be classified into types such as Ethernet cards and Token Ring cards. While certain fields of the net_device structure are set to the same value for all devices of the same type, some fields must be set differently by each model of device. Thus, for almost every type, Linux provides a general function that initializes the parameters whose values stay the same across all models. Each device driver invokes this function in addition to setting those fields that have unique values for its model. Drivers can also overwrite fields that were already initialized by the kernel (for instance, to improve performance). You can find more details in Chapter 8.

net_device结构体的字段可以分为以下几类:

The fields of the net_device structure can be classified into the following categories:

  • 配置

  • Configuration

  • 统计数据

  • Statistics

  • 设备状态

  • Device status

  • 清单管理

  • List management

  • 交通管理

  • Traffic management

  • 特定功能

  • Feature specific

  • 通用的

  • Generic

  • 函数指针(或 VFT)

  • Function pointers (or VFT)

身份标识

Identifiers

net_device结构体包含三个标识符, 不被混淆:

The net_device structure includes three identifiers , not to be confused:

int ifindex
int ifindex

一个唯一的 ID,在通过调用 注册时分配给每个设备 dev_new_index

A unique ID, assigned to each device when it is registered with a call to dev_new_index.

int iflink
int iflink

该字段主要由(虚拟)隧道设备使用,并标识将用于到达隧道另一端的真实设备。

This field is mainly used by (virtual) tunnel devices and identifies the real device that will be used to reach the other end of the tunnel.

unsigned short dev_id
unsigned short dev_id

目前由 IPv6 与 zSeries OSA NIC 一起使用。该字段用于区分可以在不同操作系统之间同时共享的同一设备的虚拟实例。请参阅net/ipv6/addrconf.c中的注释。

Currently used by IPv6 with the zSeries OSA NICs. The field is used to differentiate between virtual instances of the same device that can be shared between different OSes concurrently. See comments in net/ipv6/addrconf.c.

配置

Configuration

一些配置字段内核根据网络设备的类别给定一个默认值,并且一些字段留给驱动程序来填充。如前所述,驱动程序可以更改默认值,甚至可以通过ifconfig和等命令在运行时更改某些字段ip。事实上,几个参数—— base_addrif_portdmaand irq——通常是由用户在加载设备模块时设置的。另一方面,虚拟设备不使用这些参数。

Some of the configuration fields are given a default value by the kernel that depends on the class of network device, and some fields are left to the driver to fill. The driver can change defaults, as mentioned earlier, and some fields can even be changed at runtime by commands such as ifconfig and ip. In fact, several parameters—base_addr, if_port, dma, and irq—are commonly set by the user when the module for the device is loaded. On the other hand, these parameters are not used by virtual devices.

char name[IFNAMSIZ]
char name[IFNAMSIZ]

设备的名称(例如eth0)。

Name of the device (e.g., eth0).

unsigned long mem_start
unsigned long mem_start

unsigned long mem_end
unsigned long mem_end

这些字段描述设备用于与内核通信的共享内存。它们仅在设备驱动程序中初始化和访问;更高层不需要关心它们。

These fields describe the shared memory used by the device to communicate with the kernel. They are initialized and accessed only within the device driver; higher layers do not need to care about them.

unsigned long base_addr
unsigned long base_addr

I/O 内存的开头映射到设备自己的内存。

The beginning of the I/O memory mapped to the device's own memory.

unsigned int irq
unsigned int irq

设备用于与内核对话的中断号。它可以在多个设备之间共享。驱动程序使用该request_irq函数来分配该变量并free_irq释放它。

The interrupt number used by the device to talk to the kernel. It can be shared among multiple devices. Drivers use the request_irq function to allocate this variable and free_irq to release it.

unsigned char if_port
unsigned char if_port

用于此接口的端口类型。请参阅下一节“接口类型和端口”。

The type of port being used for this interface. See the next section, "Interface types and ports."

unsigned char dma
unsigned char dma

设备使用的 DMA 通道(如果有)。为了从内核获取和释放 DMA 通道,文件kernel/dma.c定义了函数request_dmafree_dma为了在获取 DMA 通道后启用或禁用它,各种include/asm-文件(例如include/asm-i386)中提供了函数enable_dma和。这些例程由 ISA 设备使用;外围组件互连 (PCI) 设备不需要它们,因为它们使用其他设备。disable_dma architecture

DMA 并非适用于所有设备,因为某些总线不使用它。

The DMA channel used by the device (if any). To obtain and release a DMA channel from the kernel, the file kernel/dma.c defines the functions request_dma and free_dma. To enable or disable a DMA channel after obtaining it, the functions enable_dma and disable_dma are provided in various include/asm- architecture files (e.g., include/asm-i386). The routines are used by ISA devices; Peripheral Component Interconnect (PCI) devices do not need them because they use others instead.

DMA is not available for all devices because some buses don't use it.

unsigned short flags
unsigned short flags

unsigned short gflags
unsigned short gflags

unsigned short priv_flags
unsigned short priv_flags

该字段中的一些位flags表示网络设备的功能(例如IFF_MULTICAST),其他位表示变化的状态(例如IFF_UPIFF_RUNNING)。您可以在include/linux/if.h中找到这些标志的完整列表 。设备驱动程序通常在初始化时设置功能,状态标志由内核管理以响应外部事件。可以通过熟悉的命令查看标志的设置ifconfig

bash#ifconfig lo
lo 链接 encap:本地环回
            inet 地址:127.0.0.1 掩码:255.0.0.0
            上环回运行 MTU:3924 公制:1
            RX 数据包:198 错误:0 丢弃:0 溢出:0 帧:0
            TX 数据包:198 错误:0 丢弃:0 溢出:0 运营商:0
            碰撞:0 txqueuelen:0

在此示例中,单词UP LOOPBACK RUNNING对应于标志IFF_UPIFF_LOOPBACKIFF_RUNNING

priv_flags存储对用户空间不可见的标志。目前,该字段由 VLAN 和网桥虚拟设备使用。gflags几乎从未使用过,并且出于兼容性原因而存在。可以通过该函数更改标志dev_change_flags

Some bits in the flags field represent capabilities of the network device (such as IFF_MULTICAST) and others represent changing status (such as IFF_UP or IFF_RUNNING). You can find the complete list of these flags in include/linux/if.h. The device driver usually sets the capabilities at initialization time, and the status flags are managed by the kernel in response to external events. The settings of the flags can be viewed through the familiar ifconfig command:

bash# ifconfig lo
lo          Link encap:Local Loopback
            inet addr:127.0.0.1  Mask:255.0.0.0
            UP LOOPBACK RUNNING  MTU:3924  Metric:1
            RX packets:198 errors:0 dropped:0 overruns:0 frame:0
            TX packets:198 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:0

In this example, the words UP LOOPBACK RUNNING correspond to the flags IFF_UP, IFF_LOOPBACK, and IFF_RUNNING.

priv_flags stores flags that are not visible to the user space. Right now this field is used by the VLAN and Bridge virtual devices. gflags is almost never used and is there for compatibility reasons. Flags can be changed through the dev_change_flags function.

int features
int features

另一个标志位图用于存储一些其他设备功能。该数据结构包含多个标志变量并不多余。该 features字段报告卡与 CPU 通信的能力,例如卡是否可以对高内存执行 DMA,或者对硬件中的所有数据包进行校验和。可能的功能列表是在结构net_device本身内部定义的。该参数由设备驱动程序初始化。您可以在数据结构定义中找到功能列表NETIF_F_ XXX以及良好的注释 net_device

Another bitmap of flags used to store some other device capabilities. It is not redundant for this data structure to contain multiple flag variables. The features field reports the card's capabilities for communicating with the CPU, such as whether the card can do DMA to high memory, or checksum all the packets in hardware. The list of the possible features is defined inside the structure net_device itself. This parameter is initialized by the device driver. You can find the list of NETIF_F_ XXX features, along with good comments, inside the net_device data structure definition.

unsigned mtu
unsigned mtu

MTU 代表最大传输单元,它表示设备可以处理的最大帧大小。表 2-1显示了最常见网络技术的值。

MTU stands for Maximum Transmission Unit and it represents the maximum size of the frames that the device can handle. Table 2-1 shows the values for the most common network technologies.

表 2-1。不同设备类型的 MTU 值

Table 2-1. MTU values for different device types

设备类型

Device type

最大传输单元

MTU

购买力平价

PPP

296

296

SLIP

296

296

以太网

Ethernet

1,500 人

1,500

综合业务数字网

ISDN

1,500 人

1,500

PLIP

PLIP

1,500 ( ether_setup)

1,500 (ether_setup)

韦夫兰

Wavelan

1,500 ( ether_setup)

1,500 (ether_setup)

以太通道

EtherChannel

2,024

2,024

FDDI

FDDI

4,352

4,352

令牌环 4 MB/秒 (IEEE 802.5)

Token Ring 4 MB/s (IEEE 802.5)

4,464

4,464

令牌总线 (IEEE 802.4)

Token Bus (IEEE 802.4)

8,182

8,182

令牌环 16 MB/秒 (IBM)

Token Ring 16 MB/s (IBM)

17,914

17,914

超通道

Hyperchannel

65,535

65,535

以太网 MTU 值得稍微澄清一下。以太网帧规范将最大有效负载大小定义为 1,500 字节。有时您会发现以太网 MTU 定义为 1,518 或 1,514:第一个是包括标头的以太网帧的最大大小,第二个包括标头但不包括帧校验序列(4 个字节的校验和)。

The Ethernet MTU deserves a little clarification. The Ethernet frame specification defines the maximum payload size as 1,500 bytes. Sometimes you find the Ethernet MTU defined as 1,518 or 1,514: the first is the maximum size of an Ethernet frame including the header, and the second includes the header but not the frame check sequence (4 bytes of checksum).

1998年,Alteon Networks(2000年被北电网络收购)提出了一项计划,将以太网帧的最大有效负载增加到9 KB。该提案后来通过 IETF 互联网草案正式确定,但 IEEE 从未接受。IEEE 规范中超过 1,500 字节有效负载的帧通常称为巨型帧,与千兆位以太网一起使用以提高吞吐量。这是因为较大的帧意味着用于大数据传输的帧较少、中断较少,因此 CPU 使用率较低、标头开销较低等)。有关增加以太网 MTU 的好处以及 IEEE 不同意标准化此扩展的原因的讨论,http://www.ietf.org/proceedings/01aug/ID/draft-ietf-isis-ext-eth-01.txt

In 1998, Alteon Networks (acquired by Nortel Networks in 2000) promoted an initiative to increase the maximum payload of Ethernet frames to 9 KB. This proposal was later formalized with an IETF Internet draft, but the IEEE never accepted it. Frames exceeding the 1,500 bytes of payload in the IEEE specification are commonly called jumbo frames and are used with Gigabit Ethernet to increase throughput. This is because bigger frames mean fewer frames for large data transfers, fewer interrupts, and therefore less CPU usage, less header overhead, etc.). For a discussion of the benefits of increasing the Ethernet MTU and why IEEE does not agree on standardizing this extension, you can read the white paper "Use of Extended Frame Sizes in Ethernet Networks" that can be found with an Internet search, as well as at http://www.ietf.org/proceedings/01aug/I-D/draft-ietf-isis-ext-eth-01.txt.

unsigned short type
unsigned short type

它所属的设备类别(以太网、帧中继等)。 include/linux/if_arp.h包含可能类型的完整列表。

The category of devices to which it belongs (Ethernet, Frame Relay, etc.). include/linux/if_arp.h contains the complete list of possible types.

unsigned short hard_header_len
unsigned short hard_header_len

设备标头的大小(以八位字节为单位)。例如,以太网报头的长度为 14 个八位位组。每个设备头的长度在该设备的头文件中定义。例如,对于以太网,在<include/linux/if_ether.h>ETH_HLEN中定义。

The size of the device header in octets. The Ethernet header, for instance, is 14 octets long. The length of each device header is defined in the header file for that device. For Ethernet, for instance, ETH_HLEN is defined in <include/linux/if_ether.h>.

unsigned char broadcast[MAX_ADDR_LEN]
unsigned char broadcast[MAX_ADDR_LEN]

链路层广播地址。

The link layer broadcast address.

unsigned char dev_addr[MAX_ADDR_LEN]
unsigned char dev_addr[MAX_ADDR_LEN]

unsigned char addr_len
unsigned char addr_len

dev_addr是设备链路层地址;不要将其与 L3 或 IP 地址混淆。地址的八位字节长度由 给出addr_len。的值addr_len取决于设备的类型。以太网地址的长度为 8 个八位字节。

dev_addr is the device link layer address; do not confuse it with the L3 or IP address. The address's length in octets is given by addr_len. The value of addr_len depends on the type of device. Ethernet addresses are 8 octets long.

int promiscuity
int promiscuity

请参阅后面的“混杂模式”部分。

See the later section "Promiscuous mode."

接口类型和端口

Interface types and ports

有些设备配有多个连接器(最常见的组合是 BNC + RJ45),允许用户根据需要选择其中之一。该参数用于设置设备的端口类型。当配置命令未强制设备驱动程序选择特定端口类型时,它只是选择默认端口类型。在某些情况下,单个设备驱动程序可以处理不同的接口模型;在这些情况下,接口只需按特定顺序尝试所有端口类型即可发现要使用的端口类型。这段代码显示了一个设备驱动程序如何根据其配置方式设置接口模式:

Some devices come with more than one connector (the most common combination is BNC + RJ45) and allow the user to select one of them depending on her needs. This parameter is used to set the port type for the device. When the device driver is not forced by configuration commands to select a specific port type, it simply chooses a default one. There are also cases where a single device driver can handle different interface models; in those situations, the interface can discover the port type to use by simply trying all of them in a specific order. This piece of code shows how one device driver sets the interface mode depending on how it has been configured:

        开关(dev->if_port){
        情况 IF_PORT_10BASE2:
            writeb((readb(addr) & 0xf8) | 1, addr);
            休息;
        案例 IF_PORT_10BASET:
            writeb((readb(addr) & 0xf8), addr);
            休息;
        }
        switch (dev->if_port) {
        case IF_PORT_10BASE2:
            writeb((readb(addr) & 0xf8) | 1, addr);
            break;
        case IF_PORT_10BASET:
            writeb((readb(addr) & 0xf8), addr);
            break;
        }

混杂模式

Promiscuous mode

某些网络管理任务要求系统接收通过共享电缆传输的所有帧,而不仅仅是直接寻址到它的帧;接收所有数据包的设备被认为处于混杂模式 。例如,检查本地网段上的性能或安全漏洞的应用程序需要此模式。桥接代码也使用混杂模式(参见第四部分)。最后,不幸的是,它对恶意窥探者具有明显的价值;因此,除非加密,否则任何数据对于本地网络上的其他用户来说都是安全的。

Certain network administration tasks require a system to receive all the frames that travel across a shared cable, not just the ones directly addressed to it; a device that receives all packets is said to be in promiscuous mode . This mode is needed, for instance, by applications that check performance or security breaches on their local network segment. Promiscuous mode is also used by bridging code (see Part IV). Finally, it has obvious value to malicious snoopers, unfortunately; for this reason, no data is secure from other users on a local network unless it is encrypted.

net_device结构包含一个名为 的计数器promiscuity,指示设备处于混杂模式。它是一个计数器而不是一个简单的标志的原因是多个客户端可能会要求混杂模式;因此,每个在进入该模式时递增计数器并在离开该模式时递减计数器。在计数器达到零之前,设备不会离开混杂模式。通常通过调用函数来操作该字段dev_set_promiscuity

The net_device structure contains a counter named promiscuity that indicates a device is in promiscuous mode. The reason it is a counter rather than a simple flag is that several clients may ask for promiscuous mode; therefore, each increments the counter when entering the mode and decrements the counter when leaving the mode. The device does not leave promiscuous mode until the counter reaches zero. Usually the field is manipulated by calling the function dev_set_promiscuity.

每当promiscuity为非零时(例如通过调用dev_set_promiscuity), 也会设置 的IFF_PROMISC位标志,并由配置接口的函数进行检查。flags

Whenever promiscuity is nonzero (such as through a call to dev_set_promiscuity), the IFF_PROMISC bit flag of flags is also set and is checked by the functions that configure the interface.

以下代码取自drivers/net/3c59x.c驱动程序,显示了如何根据字段中的标志(位)设置不同的接收模式flags

The following piece of code, taken from the drivers/net/3c59x.c driver, shows how the different receive modes are set based on the flags (bits) in the flags field:

静态无效set_rx_mode(结构net_device * dev)
{
        int ioaddr = dev->base_addr;
        int new_mode;

        if (dev->flags & IFF_PROMISC) {
             如果(corqscreq_debug > 3)
                        printk("%s: 设置混杂模式。\n", dev->name);
             new_mode = 设置RxFilter | 接收站 | Rx多播| Rx广播| RxProm;
        } else if ((dev->mc_list) || (dev->flags & IFF_ALLMULTI)) {
             new_mode = 设置RxFilter | 接收站 | Rx多播| 接收广播;
        } 别的
             new_mode = 设置RxFilter | 接收站 | 接收广播;

        outw(new_mode, ioaddr + EL3_CMD);
}
static void set_rx_mode(struct net_device *dev)
{
        int ioaddr = dev->base_addr;
        int new_mode;

        if (dev->flags & IFF_PROMISC) {
             if (corqscreq_debug > 3)
                        printk("%s: Setting promiscuous mode.\n", dev->name);
             new_mode = SetRxFilter | RxStation | RxMulticast | RxBroadcast | RxProm;
        } else if ((dev->mc_list)  ||  (dev->flags & IFF_ALLMULTI)) {
             new_mode = SetRxFilter | RxStation | RxMulticast | RxBroadcast;
        } else
             new_mode = SetRxFilter | RxStation | RxBroadcast;

        outw(new_mode, ioaddr + EL3_CMD);
}

IFF_PROMISC当设置该标志时,new_mode变量被初始化以接受寻址到卡的流量 ( RxStation)、多播流量 ( RxMulticast)、广播流量 ( RxBroadcast) 和所有其他流量 ( RxProm)。EL3_CMD是内存地址的偏移量ioaddr,表示与设备交互时应复制命令的位置。

When the IFF_PROMISC flag is set, the new_mode variable is initialized to accept the traffic addressed to the card (RxStation), multicast traffic (RxMulticast), broadcast traffic (RxBroadcast), and all the other traffic (RxProm). EL3_CMD is the offset to the ioaddr memory address that represents where commands are supposed to be copied when interacting with the device.

统计数据

Statistics

而不是提供字段集合来保存统计信息,该net_device结构包括一个名为 的指针priv,该指针由驱动程序设置为指向存储有关接口的信息的私有数据结构。私有数据由统计数据组成,例如发送和接收的数据包数量以及遇到的错误数量。

Instead of providing a collection of fields to keep statistics , the net_device structure includes a pointer named priv that is set by the driver to point to a private data structure storing information about the interface. The private data consists of statistics such as the number of packets transmitted and received and the number of errors encountered.

由 指向的结构的格式priv 取决于设备类型和特定型号:因此,不同的以太网卡可以使用不同的私有结构。然而,几乎所有结构都包含一个类型字段(在include/linux/netdevice.hnet_device_stats中定义),该字段包含所有网络设备共有的统计信息,并且可以使用稍后描述的方法检索。get_stats

The format of the structure pointed at by priv depends both on the device type and on the particular model: thus, different Ethernet cards may use different private structures. However, nearly all structures include a field of type net_device_stats (defined in include/linux/netdevice.h) that contains statistics common to all the network devices and that can be retrieved with the method get_stats, described later.

无线设备的行为与有线设备如此不同,以至于无线设备找不到net_device_stats合适的数据结构。相反,它们提供了一个类型的字段iw_statistics ,可以使用get_wireless_stats稍后描述的名为 的方法来检索该字段。

Wireless devices behave so differently from wired devices that wireless ones do not find the net_device_stats data structure appropriate. Instead, they provide a field of type iw_statistics that can be retrieved using a method called get_wireless_stats, described later.

指向的数据结构priv有时具有反映接口的名称(例如,vortex_private对于 Vortex 和 Boomerang 系列,也称为 3c59x 系列),而其他时候则简称为net_local。尽管如此, 中的字段net_local仍然由每个设备驱动程序唯一定义。

The data structure to which priv points sometimes has a name reflecting the interface (e.g., vortex_private for the Vortex and Boomerang series, also called the 3c59x family), and other times is simply called net_local. Still, the fields in net_local are defined uniquely by each device driver.

私有数据结构可能或多或少复杂,具体取决于卡的功能以及设备驱动程序编写者愿意采用复杂的统计数据和复杂的设计来增强性能的程度。例如,将 drivers/net/3c507.cnet_local中 3c507 以太网卡使用的通用结构与drivers/net/3c59x.c中 3c59x 以太网卡使用的高度详细的 结构进行比较。然而,两者都包含一个 类型的字段。vortex_privatenet_device_stats

The private data structure may be more or less complex depending on the card's capabilities and on how much the device driver writer is willing to employ sophisticated statistics and complex design to enhance performance. Compare, for instance, the generic net_local structure used by the 3c507 Ethernet card in drivers/net/3c507.c with the highly detailed vortex_private structure used by the 3c59x Ethernet card in drivers/net/3c59x.c. Both, however, include a field of type net_device_stats.

正如您将在第 8 章中看到的,私有数据结构有时会附加到net_device结构本身(malloc两者只需要一个),有时会分配为单独的块。

As you will see in Chapter 8, the private data structure is sometimes appended to the net_device structure itself (requiring only one malloc for both) and sometimes allocated as a separate block.

设备状态

Device Status

为了控制与 NIC 的交互,每个设备驱动程序都必须维护诸如时间戳和指示接口需要哪种行为的标志等信息。在对称多处理 (SMP) 系统中,内核还必须确保正确处理不同 CPU 对同一设备的并发访问。该结构的几个字段net_device专用于这些类型的信息:

To control interactions with the NIC, each device driver has to maintain information such as timestamps and flags indicating what kind of behavior the interface requires. In a symmetric multiprocessing (SMP) system, the kernel also has to make sure that concurrent accesses to the same device from different CPUs are handled correctly. Several fields of the net_device structure are dedicated to these types of information:

unsigned long state
unsigned long state

网络排队子系统使用的一组标志。它们由 enum 中的常量索引,该常量在include/linux/netdevice.hnetdev_state_t中定义,并定义了 每个位等常量。使用通用函数和来设置和清除各个位 ,通常通过隐藏所使用位的详细信息的包装器来调用。例如,要停止设备队列,子系统会调用 ,如下所示:_ _LINK_STATE_XOFFset_bitclear_bitnetif_stop_queue

静态内联无效 netif_stop_queue(struct net_device *dev)
{
    ...
    set_bit(_ _LINK_STATE_XOFF, &dev->state);
}

第11章简要介绍了交通控制子系统。

A set of flags used by the network queuing subsystem. They are indexed by the constants in the enum netdev_state_t, which is defined in include/linux/netdevice.h and defines constants such as _ _LINK_STATE_XOFF for each bit. Individual bits are set and cleared using the general functions set_bit and clear_bit, usually invoked through a wrapper that hides the details of the bit used. For example, to stop a device queue, the subsystem invokes netif_stop_queue, which looks like this:

static inline void netif_stop_queue(struct net_device *dev)
{
    ...
    set_bit(_ _LINK_STATE_XOFF, &dev->state);
}

The Traffic Control subsystem is briefly introduced in Chapter 11.

enum {...} reg_state
enum {...} reg_state

设备的注册状态。参见第 8 章

The registration state of the device. See Chapter 8.

unsigned long trans_start
unsigned long trans_start

最后一帧传输开始的时间(以 jiffies 为单位)。设备驱动程序在开始传输之前设置它。该字段用于检测卡在给定时间后未完成传输的问题。传输过长说明有问题;在这种情况下,驱动程序通常会重置卡。

The time (measured in jiffies) when the last frame transmission started. The device driver sets it just before starting transmission. The field is used to detect problems with the card if it does not finish transmission after a given amount of time. An overly long transmission means there is something wrong; in that case, the driver usually resets the card.

unsigned long last_rx
unsigned long last_rx

收到最后一个数据包的时间(以 jiffies 为单位)。目前,它没有用于任何特定目的,但可以在需要时使用。

The time (measured in jiffies) when the last packet was received. At the moment, it is not used for any specific purpose, but is available in case of need.

struct net_device *master
struct net_device *master

存在一些允许将一组设备分组在一起并被视为单个设备的协议。这些协议包括EQL(串行网络接口的均衡器负载平衡器)、绑定(也称为EtherChannel 和中继)以及流量控制的TEQL(真正的均衡器)排队规则。组中的一台设备被选举为所谓的主设备,起着特殊的作用。net_device 该字段是指向该组主设备的数据结构的指针。如果接口不是此类组的成员,则指针只是 NULL。

Some protocols exist that allow a set of devices to be grouped together and be treated as a single device. These protocols include EQL (Equalizer Load-balancer for serial network interfaces), Bonding (also called EtherChannel and trunking), and the TEQL (true equalizer) queuing discipline of Traffic Control. One of the devices in the group is elected to be the so-called master, which plays a special role. This field is a pointer to the net_device data structure of the master device of the group. If the interface is not a member of such a group, the pointer is simply NULL.

spinlock_t xmit_lock
spinlock_t xmit_lock

int xmit_lock_owner
int xmit_lock_owner

The xmit_lock锁用于串行化对驱动程序函数的访问hard_start_xmit。这意味着每个 CPU 在任何给定设备上一次只能执行一次传输。xmit_lock_owner是持有锁的CPU的ID。在单处理器系统上它始终为 0,在 SMP 系统上未获取锁时始终为 -1。当设备驱动程序支持时,也可以进行无锁传输。有关有锁和无锁情况,请参阅第 11 章。

The xmit_lock lock is used to serialize accesses to the driver function hard_start_xmit. This means that each CPU can carry out only one transmission at a time on any given device. xmit_lock_owner is the ID of the CPU that holds the lock. It is always 0 on single-processor systems and -1 when the lock is not taken on SMP systems. It is possible to have lockless transmissions, too, when the device driver supports it. See Chapter 11 for both the lock and the lockless cases.

void *atalk_ptr
void *atalk_ptr

void *ip_ptr
void *ip_ptr

void *dn_ptr
void *dn_ptr

void *ip6_ptr
void *ip6_ptr

void *ec_ptr
void *ec_ptr

void *ax25_ptr
void *ax25_ptr

这六个字段是指向特定协议特定的数据结构的指针,每个数据结构都包含该协议私有使用的参数。ip_ptr例如,指向一个类型的数据结构in_device(即使它被声明为void *),其中包含不同的 IPv4 相关参数,其中包括在接口上配置的 IP 地址列表(请参阅第 19 章)。本书的其他部分描述了本书中涉及的协议所使用的数据结构的字段。大多数时候,仅使用这些字段之一。

These six fields are pointers to data structures specific to particular protocols, each data structure containing parameters that are used privately by that protocol. ip_ptr, for instance, points to a data structure of type in_device (even though it is declared as void *) that contains different IPv4-related parameters, among them the list of IP addresses configured on the interface (see Chapter 19). Other sections of this book describe the fields of the data structures used by protocols covered in the book. Most of the time only one of these fields is in use.

名单管理

List Management

net_device数据结构被插入到一个全局列表和两个哈希表中,如第 8 章所述。以下字段用于完成这些任务:

net_device data structures are inserted into a global list and into two hash tables, as described in Chapter 8. The following fields are used to accomplish these tasks:

struct net_device *next
struct net_device *next

将每个net_device数据结构链接到全局列表中的下一个数据结构。

Links each net_device data structure to the next in the global list.

struct hlist_node name_hlist
struct hlist_node name_hlist

struct hlist_node index_hlist
struct hlist_node index_hlist

net_device结构链接到存储桶的两个哈希表列表。

Link the net_device structure to the bucket's list of two hash tables.

链路层组播

Link Layer Multicast

多播是一种用于将数据传送给多个接收者的机制。组播可在 L3 网络层(即 IP)和 L2 链路层(即以太网)上使用。在本节中,我们关注的是后者。

Multicast is a mechanism used to deliver data to multiple recipients. Multicasting can be available both at the L3 network layer (i.e., IP) and at the L2 link layer (i.e., Ethernet). In this section, we are concerned with the latter.

链路层多播传送可以通过使用链路层报头中的特殊地址或控制信息来实现。(当链路层协议不支持时,可以对其进行仿真。)以太网本身支持多播:我们将在第 13 章中看到以太网地址如何分为单播、多播或广播。

Link layer multicast delivery can be achieved by using special addresses or control information in the link layer header. (When it is not supported by the link layer protocol, it may be emulated.) Ethernet natively supports multicasting: we will see in Chapter 13 how an Ethernet address can be classified as unicast, multicast, or broadcast.

多播地址通过特定位与其他地址范围区分开来。这意味着 50% 的可能地址是多播的,2 48中的 50%是一个巨大的数字!当一个接口被要求加入许多多播组(每个多播组由一个多播地址标识)时,简单地侦听所有多播地址可能会更高效、更快速,而不是维护一个长列表并浪费时间过滤入口 L2基于列表的多播帧。flags数据结构中的其中之一net_device指示设备是否应该监听所有地址。关于何时设置或清除该标志的决定由all_multi本节中显示的字段。

Multicast addresses are distinguished from the range of other addresses by a specific bit. This means that 50% of the possible addresses are multicast, and 50% of 248 is a huge number! When an interface is asked to join a lot of multicast groups (each identified by a multicast address), it may be more efficient and faster for it to simply listen to all the multicast addresses instead of maintaining a long list and wasting time filtering ingress L2 multicast frames based on the list. One of the flags in the net_device data structure indicates whether the device should listen to all addresses. The decision about when to set or clear this flag is controlled by the all_multi field shown in this section.

dev_mc_list 每个设备为每个链路层多播保留一个结构实例解决它所听的问题。dev_mc_add链路层组播地址可以分别用和函数添加和删除dev_mc_deletenet-device结构中的相关字段包括:

Each device keeps an instance of the dev_mc_list structure for each link layer multicast address it listens to. Link layer multicast addresses can be added and removed with the functions dev_mc_add and dev_mc_delete, respectively. Relevant fields in the net-device structure include:

struct dev_mc_list *mc_list
struct dev_mc_list *mc_list

指向该设备结构列表头部的指针dev_mc_list

Pointer to the head of this device's list of dev_mc_list structures.

int mc_count
int mc_count

该设备的多播地址数量,也是指向的列表的长度mc_list

The number of multicast addresses for this device, which is also the length of the list to which mc_list points.

int allmulti
int allmulti

当非零时,使设备侦听所有多播地址。promiscuity本章前面讨论的Like allmulti是一个引用计数,而不是一个简单的布尔值。这是因为多个设施(例如 VLAN 和绑定设备)可能需要独立监听所有地址。当变量从 0 变为非零时,dev_set_allmulti调用该函数来指示接口侦听所有多播地址。allmulti当变为 0时,情况相反。

When nonzero, causes the device to listen to all multicast addresses. Like promiscuity, discussed earlier in this chapter, allmulti is a reference count rather than a simple Boolean. This is because multiple facilities (VLANs and bonding devices, for instance) may independently require listening to all addresses. When the variable goes from 0 to nonzero, the function dev_set_allmulti is called to instruct the interface to listen to all multicast addresses. The opposite happens when allmulti goes to 0.

交通管理

Traffic Management

Linux 的流量控制子系统已经发展了很多,代表了 Linux 内核的优势之一。相关的内核选项是“设备驱动程序 → 网络支持 → 网络选项 → QoS 和/或公平队列”。net-device结构中的相关字段 包括:

The Traffic Control subsystem of Linux has grown quite a lot and represents one of the strengths of the Linux kernel. The associated kernel option is "Device drivers → Networking support → Networking options → QoS and/or fair queueing." Relevant fields in the net-device structure include:

struct net_device *next_sched
struct net_device *next_sched

由第 11 章中描述的软件中断之一使用。

Used by one of the software interrupts described in Chapter 11.

struct Qdisc *qdisc
struct Qdisc *qdisc

struct Qdisc *qdisc_sleeping
struct Qdisc *qdisc_sleeping

struct Qdisc *qdisc_ingress
struct Qdisc *qdisc_ingress

struct list_head qdisc_list
struct list_head qdisc_list

这些字段用于管理入口和出口数据包队列以及从不同 CPU 对设备的访问。

These fields are used to manage the ingress and egress packet queues and access to the device from different CPUs.

spinlock_t queue_lock
spinlock_t queue_lock

spinlock_t ingress_lock
spinlock_t ingress_lock

流量控制子系统为每个网络设备定义一个专用出口队列。queue_lock用于避免同时访问它(参见第 11 章)。ingress_lock对入口流量执行相同的操作。

The Traffic Control subsystem defines a private egress queue for each network device. queue_lock is used to avoid simultaneous accesses to it (see Chapter 11). ingress_lock does the same for ingress traffic.

unsigned long tx_queue_len
unsigned long tx_queue_len

设备传输队列的长度。当内核中存在流量控制支持时,tx_queue_len可能不会使用它(只有少数排队规则使用它)。表 2-2显示了最常见设备类型所使用的值。它的值可以通过sysfs文件系统进行调整(参见/sys/class/net/ /目录)。device_name

表 2-2。不同设备类型的 tx_queue_len 值

设备类型

tx_queue_len

TEQL是您可以使用流量控制(QoS 层)配置的排队规则之一。

以太网

1,000

令牌环

100

以太通道

100

光纤通道

100

FDDI

100

TEQL(真链路均衡器)a

100

综合业务数字网

30

HIPPI

25

PLIP

10

10

AX25

10

EQL(串行网络接口的均衡器负载平衡器)

5

通用PPP

3

粘合

0

环回

0

0

虚拟局域网

0

根据所使用的排队规则(用于对传入和传出数据包进行排队的策略),tx_queue_len可能会使用也可能不会使用。它通常在队列类型为 FIFO(先进先出)或其他相对简单的队列类型时使用。

请注意,队列长度为 0 的所有设备都是虚拟设备:它们依赖关联的真实设备来进行任何排队(环回设备除外,环回设备不需要它,因为它位于内核内部并传递所有流量)立即地)。

The length of the device's transmission queue. When Traffic Control support is present in the kernel, tx_queue_len may not be used (only a few queuing discipline use it). Table 2-2 shows the values used for the most common device types. Its value can be tuned with the sysfs filesystem (see the /sys/class/net/ device_name / directories).

Table 2-2. tx_queue_len values for different device types

Device type

tx_queue_len

a TEQL is one of the queuing disciplines you can configure with Traffic Control (the QoS layer).

Ethernet

1,000

Token Ring

100

EtherChannel

100

Fibre Channel

100

FDDI

100

TEQL (true link equalizer)a

100

ISDN

30

HIPPI

25

PLIP

10

SLIP

10

AX25

10

EQL (Equalizer load balancer for serial network interfaces)

5

Generic PPP

3

Bonding

0

Loopback

0

Bridge

0

VLAN

0

Depending on the queuing discipline—the strategy used to queue incoming and outgoing packets—in use, tx_queue_len may or may not be used. It is usually used when the queue type is FIFO (First In, First Out) or something else relatively simple.

Note that all devices with a queue length of 0 are virtual devices: they rely on the associated real devices to do any queuing (with the exception of the loopback device, which does not need it because it is internal to the kernel and delivers all traffic immediately).

特定功能

Feature Specific

正如我们在描述时看到的,只有当它们所属的功能已包含在内核中时,sk_buff一些参数才会包含在定义中: [ 1 ]net_device

As we saw when describing sk_buff, a few parameters are included in the definition of net_device only if the features they belong to have been included in the kernel:[1]

struct divert_blk *divert
struct divert_blk *divert

Diverter 是一项允许您更改传入数据包的源地址和目标地址的功能。这使得可以将具有配置指定的特定特征的流量重新路由到不同的接口或不同的主机。为了正常工作并有意义,分流器需要其他功能,例如桥接。该字段指向的数据结构存储了分流器功能所需的参数。关联的内核选项是“设备驱动程序→网络支持→网络选项→帧转向器”。

Diverter is a feature that allows you to change the source and destination addresses of the incoming packet. This makes it possible to reroute traffic with specific characteristics specified by the configuration to a different interface or a different host. To work properly and to make sense, diverter needs other features such as bridging. The data structure pointed to by this field stores the parameters needed by the diverter feature. The associated kernel option is "Device drivers → Networking support → Networking options → Frame Diverter."

struct net_bridge_port *br_port
struct net_bridge_port *br_port

将设备配置为桥接端口时需要的额外信息。第四部分介绍了桥接代码和生成树协议 (STP) 。关联的内核选项是“设备驱动程序 → 网络支持 → 网络选项 → 802.1d 以太网桥接”。

Extra information needed when the device is configured as a bridged port. The bridging code and the Spanning Tree Protocol (STP) are covered in Part IV. The associated kernel option is "Device drivers → Networking support → Networking options → 802.1d Ethernet Bridging."

void (*vlan_rx_register)(...)
void (*vlan_rx_register)(...)

void (*vlan_rx_add_vid)(...)
void (*vlan_rx_add_vid)(...)

void (*vlan_rx_kill_vid)(...)
void (*vlan_rx_kill_vid)(...)

VLAN 代码使用这三个函数指针将设备注册为具有 VLAN 标记功能(请参阅net/8021q/vlan.c)、向设备添加 VLAN 以及从设备删除 VLAN。关联的内核选项是“设备驱动程序 → 网络支持 → 网络选项 → 802.1Q VLAN 支持”。

These three function pointers are used by the VLAN code to register a device as VLAN tagging capable (see net/8021q/vlan.c), add a VLAN to the device, and delete the VLAN from the device, respectively. The associated kernel option is "Device drivers → Networking support → Networking options → 802.1Q VLAN Support."

int netpoll_rx
int netpoll_rx

void (*poll_controller)(...)
void (*poll_controller)(...)

由第 10 章中简要提到的可选 Netpoll 功能使用。

Used by the optional Netpoll feature that is briefly mentioned in Chapter 10.

通用的

Generic

除了net_device 前面讨论的结构的列表管理字段之外,还有一些其他字段用于管理结构并确保它们在不需要时被删除:

In addition to the list management fields of the net_device structure discussed earlier, a few other fields are used to manage structures and make sure they are removed when they are not needed:

atomic_t refcnt
atomic_t refcnt

参考计数。在该计数器变为零之前,无法取消注册设备(请参阅第 8 章)。

Reference count. The device cannot be unregistered until this counter has gone to zero (see Chapter 8).

int watchdog_timeo
int watchdog_timeo

struct timer_list watchdog_timer
struct timer_list watchdog_timer

与前面讨论的变量一起,这些字段实现了第 11 章“看门狗定时器tx_timeout部分中讨论的定时器。

Along with the tx_timeout variable discussed earlier, these fields implement the timer discussed in the section "Watchdog timer" in Chapter 11.

int (*poll)(...)
int (*poll)(...)

struct list_head poll_list
struct list_head poll_list

int quota
int quota

int weight
int weight

由第 10 章中描述的 NAPI 功能使用。

Used by the NAPI feature described in Chapter 10.

const struct iw_handler_def *wireless_handlers
const struct iw_handler_def *wireless_handlers

struct iw_public_data *wireless_data
struct iw_public_data *wireless_data

无线设备使用的附加参数和函数指针。另请参阅 get_wireless_stats

Additional parameters and function pointers used by wireless devices. See also get_wireless_stats.

struct list_head todo_list
struct list_head todo_list

网络设备的注册和注销分两步完成。 todo_list用于处理第二个。参见第 8 章

The registration and unregistration of a network device is done in two steps. todo_list is used to handle the second one. See Chapter 8.

struct class_device class_dev
struct class_device class_dev

由新的通用内核驱动程序基础结构使用。

Used by the new generic kernel driver infrastructure.

函数指针

Function Pointers

我们在第一章中看到网络代码大量使用了函数指针。数据 net_device结构包括其中相当多的部分。此类功能主要用于:

We saw in Chapter 1 that the networking code makes heavy use of function pointers . The net_device data structure includes quite a few of them. Such functions are used mainly to:

  • 发送和接收帧

  • Transmit and receive a frame

  • 在缓冲区上添加或解析链路层标头

  • Add or parse the link layer header on a buffer

  • 更改部分配置

  • Change a part of the configuration

  • 检索统计数据

  • Retrieve statistics

  • 与特定功能交互

  • Interact with a specific feature

前面几节在描述用于完成特定任务的字段时已经介绍了一些函数指针。以下是通用的:

A few function pointers were already introduced in the previous sections when describing the fields used to accomplish a specific task. Here are the generic ones:

struct ethtool_ops *ethtool_ops
struct ethtool_ops *ethtool_ops

指向一组函数指针的指针,用于设置或获取不同设备参数的配置。请参阅第 8 章中的“ Ethtool ”部分。

Pointer to a set of function pointers used to set or get the configuration of different device parameters. See the section "Ethtool" in Chapter 8.

int (*init)(...)
int (*init)(...)

void (*uninit)(...)
void (*uninit)(...)

void (*destructor)(...)
void (*destructor)(...)

int (*open)(...)
int (*open)(...)

int (*stop)(...)
int (*stop)(...)

用于初始化、清理、销毁、启用和禁用设备。并非所有这些都总是被使用。参见第 8 章

Used to initialize, clean up, destroy, enable, and disable a device. Not all of them are always used. See Chapter 8.

struct net_device_stats* (*get_stats)(...)
struct net_device_stats* (*get_stats)(...)

struct iw_statistics* (*get_wireless_stats)(...)
struct iw_statistics* (*get_wireless_stats)(...)

ifconfig设备驱动程序收集的一些统计信息可以通过用户空间应用程序(例如和 ) 显示,而其他统计信息则严格由内核使用,并在本章前面的“设备状态ip”部分中讨论。这两种方法用于收集统计数据。可在普通设备和无线设备上运行。另请参阅前面的“统计”部分。get_statsget_wireless_stats

Some statistics collected by the device driver can be displayed with user-space applications such as ifconfig and ip, and others are strictly used by the kernel and are discussed in the section "Device Status" earlier in this chapter. These two methods are used to collect statistics. get_stats operates on a normal device and get_wireless_stats on a wireless device. See also the earlier section "Statistics."

int (*hard_start_xmit)(...)
int (*hard_start_xmit)(...)

用于传输帧。参见第 11 章

Used to transmit a frame. See Chapter 11.

int (*hard_header)(...)
int (*hard_header)(...)

int (*rebuild_header)(...)
int (*rebuild_header)(...)

int (*hard_header_cache)(...)
int (*hard_header_cache)(...)

void (*header_cache_update)(...)
void (*header_cache_update)(...)

int (*hard_header_parse)(...)
int (*hard_header_parse)(...)

int (*neigh_setup)(...)
int (*neigh_setup)(...)

由相邻层使用。请参阅第 27 章中的“设备驱动程序提供的方法”和“邻居初始化”部分。

Used by the neighboring layer. See the sections "Methods Provided by the Device Driver" and "Neighbor Initialization" in Chapter 27.

int (*do_ioctl)(...)
int (*do_ioctl)(...)

ioctl是用于向设备发出命令的系统调用(参见第 3 章)。调用该方法来处理一些ioctl命令(参见第8章)。

ioctl is the system call used to issue commands to devices (see Chapter 3). This method is called to process some of the ioctl commands (see Chapter 8).

void (*set_multicast_list)(...)
void (*set_multicast_list)(...)

我们已经在“链路层多播” 部分中看到了mc_listmc_count用于管理 L2 多播地址列表。此方法用于要求设备驱动程序配置设备以侦听这些地址。通常它不会被直接调用,而是通过包装器(例如 dev_mc_upload或其无锁版本) 来调用_ _dev_mc_upload。当设备无法安装多播地址列表时,它只需启用所有这些地址即可。

We have already seen in the section "Link Layer Multicast" that mc_list and mc_count are used to manage the list of L2 multicast addresses. This method is used to ask the device driver to configure the device to listen to those addresses. Usually it is not called directly, but through wrappers such as dev_mc_upload or its lockless version, _ _dev_mc_upload. When a device cannot install a list of multicast addresses, it simply enables all of them.

int (*set_mac_address)(...)
int (*set_mac_address)(...)

更改设备 MAC 地址。当设备不提供此功能时(如 Bridge 虚拟设备的情况),它被设置为 NULL。

Changes the device MAC address. When the device does not provide this capability (as in the case of Bridge virtual devices), it is set to NULL.

int (*set_config)(...)
int (*set_config)(...)

配置驱动参数,如硬件参数irqio_addr、 、 等 if_port。高层参数(例如协议地址)由 处理do_ioctl。使用这种方法的设备并不多,尤其是能够更好地实现探测功能的新设备。一个很好的例子和一些文档可以在drivers/net/sis900.csis900_set_config中找到。

Configures driver parameters, such as the hardware parameters irq, io_addr, and if_port. Higher-layer parameters (such as protocol addresses) are handled by do_ioctl. Not many devices use this method, especially among the new devices that are better able to implement probe functions. A good example with some documentation can be found in sis900_set_config in drivers/net/sis900.c.

int (*change_mtu)(...)
int (*change_mtu)(...)

mtu更改设备 MTU(请参阅前面部分“配置”中的描述)。更改此字段对设备驱动程序没有影响,而只是强制内核软件遵守新的 MTU 并相应地处理碎片。

Changes the device MTU (see the description of mtu in the earlier section, "Configuration"). Changing this field has no effect on the device driver but simply forces the kernel software to respect the new MTU and to handle fragmentation accordingly.

void (*tx_timeout)(...)
void (*tx_timeout)(...)

该方法在看门狗定时器到期时调用,该方法确定传输是否花费了可疑的较长时间才能完成。除非定义此方法,否则看门狗定时器甚至不会启动。有关详细信息,请参阅第 11 章中的“看门狗定时器”部分。

The method invoked at the expiration of the watchdog timer, which determines whether a transmission is taking a suspiciously long time to complete. The watchdog timer is not even started unless this method is defined. See the section "Watchdog timer" in Chapter 11 for more information.

int (*accept_fastpath)(...)
int (*accept_fastpath)(...)

快速切换(也称为 FASTROUTE)是一项内核功能,允许设备驱动程序在中断上下文期间使用小型缓存(绕过所有软件层)路由传入流量。从2.6.8内核开始,不再支持快速切换。该方法用于测试设备是否可以使用快速切换功能。

Fast switching (also called FASTROUTE) was a kernel feature that allowed device drivers to route incoming traffic during interrupt context using a small cache (bypassing all the software layers). Fast switching is no longer supported, starting with the 2.6.8 kernel. This method was used to test whether the fast-switching feature could be used on the device.

本章提到的文件

Files Mentioned in This Chapter

本章引用的主要文件如图2-11所示。缺失的部分将在接下来的章节中介绍。

Figure 2-11 shows the main files referenced in this chapter. The missing ones will be introduced in upcoming chapters.

本章引用的文件

图 2-11。本章引用的文件

Figure 2-11. Files referenced in this chapter




[ 1 ]skb_reserve还被设备驱动程序用来对齐入口帧的 IP 标头。参见第 10 章

[1] skb_reserve is also used by device drivers to align the IP header of ingress frames. See Chapter 10.

[ 1 ]有关分段缓冲区的讨论,请参阅第 21 章。

[1] See Chapter 21 for a discussion of fragmented buffers.

[ 1 ]有关其使用示例,请参阅第 11 章中的“ dev_queue_xmit 函数”部分。

[1] See the section "dev_queue_xmit Function" in Chapter 11 for an example of its use.

[ 1 ]绑定,也称为 EtherChannel(思科术语)和中继(Sun 术语)允许将一组接口组合在一起并视为单个接口。当系统需要支持高带宽的点对点连接时,此功能非常有用。可以实现近乎线性的加速,虚拟接口的吞吐量几乎等于各个接口的吞吐量之和。

[1] Bonding, also called EtherChannel (Cisco terminology) and trunking (Sun terminology), allows a set of interfaces to be grouped together and be treated as a single interface. This feature is useful when a system needs to support point-to-point connections at a high bandwidth. A nearly linear speedup can be achieved, with the virtual interface having a throughput nearly equal to the sum of the throughputs of the individual interfaces.

[ ] VLAN 代表虚拟 LAN。使用 VLAN 是一种在不同广播域中使用同一 L2 交换机隔离流量的便捷方法,方法是通过添加到以太网帧的附加标记(称为 VLAN 标记)来隔离流量。您可以在http://www.linuxjournal.com/article/7268找到有关 VLAN 及其在 Linux 中的使用的介绍。

[] VLAN stands for Virtual LAN. The use of VLANs is a convenient way to isolate traffic using the same L2 switch in different broadcast domains by means of an additional tag, called the VLAN tag, that is added to the Ethernet frames. You can find an introduction to VLANs and their use with Linux at http://www.linuxjournal.com/article/7268.

[ 1 ]仅当关联功能是内核的一部分时,才会实际包含这些字段。例如,参见br_port

[1] The fields are actually included only when the associated feature is part of the kernel. See, for example, br_port.

第 3 章用户空间到内核接口

Chapter 3. User-Space-to-Kernel Interface

在本章中,我将简要介绍用户空间应用程序可以用来与内核通信或读取内核导出的信息的主要机制。我们不会研究它们的实现细节,因为每种机制都值得单独用一章来讨论。本章的目的是为您提供足够的代码和外部文档指示,以便您可以在感兴趣时进一步研究该主题。例如,通过本章,您将获得所需的信息,了解如何以及在何处将给定目录添加到/proc、处理给定命令的内核处理ioctl程序以及 Netlink 提供的功能(当前用户的首选接口)空间网络配置。

In this chapter, I'll briefly introduce the main mechanisms that user-space applications can use to communicate with the kernel or read information exported by it. We will not look at the details of their implementations, because each mechanism would deserve a chapter of its own. The purpose of this chapter is to give you enough pointers to the code and to external documentation so that you can further investigate the topic if interested. For example, with this chapter, you have the information you need to find how and where a given directory is added to /proc, kernel handler which processes a given ioctl command, and what functions are provided by Netlink, currently the preferred interface for user-space network configuration.

本章仅关注我在书中讨论用户空间配置命令(例如ifconfigRoute)与应用所请求配置的内核处理程序之间的接口时经常提到的机制。有关可用于内核内通信以及内核到用户空间通信的通用消息系统的分析,请参阅了解 Linux 内核(O'Reilly)。

This chapter focuses only on the mechanisms that I will often mention in the book when talking about the interface between the user-space configuration commands such as ifconfig and route and the kernel handlers that apply the requested configurations. For an analysis of the generic messaging systems available for intrakernel communication as well as kernel-to-user-space communication, please refer to Understanding the Linux Kernel (O'Reilly).

本书对每个功能的讨论以一组展示用户空间配置工具和内核如何通信的部分结束。本章中的信息可以帮助您更好地理解这些部分。

The discussion of each feature in this book ends with a set of sections that show how user-space configuration tools and the kernel communicate. The information in this chapter can help you understand those sections better.

概述

Overview

内核通过不同的接口将内部信息导出到用户空间。除了应用程序程序员可以用来请求特定信息的经典系统调用集之外,还有三个特殊接口,其中两个是虚拟文件系统:

The kernel exports internal information to user space via different interfaces. Besides the classic set of system calls the application programmer can use to ask for specific information, there are three special interfaces, two of which are virtual filesystems:

procfs /proc文件系统
procfs(/proc filesystem)

这是一个虚拟文件系统,通常挂载在/proc中,允许内核以文件的形式将内部信息导出到用户空间。这些文件实际上并不存在于磁盘上,但可以通过catmore读取它们,并使用 > shell 重定向器写入;它们甚至可以像真实文件一样被分配权限。因此,创建这些文件的内核组件可以确定谁可以读取或写入任何文件。目录无法写入(即,任何用户都无法向/proc中的任何目录添加或删除文件或目录)。

大多数(如果不是全部)Linux 发行版附带的默认内核包括对procfs的支持。它不能被编译为模块。配置菜单中的关联内核选项是“文件系统 → 伪文件系统 → /proc 文件系统支持”。

This is a virtual filesystem, usually mounted in /proc, that allows the kernel to export internal information to user space in the form of files. The files don't actually exist on disk, but they can be read through cat or more and written to with the > shell redirector; they even can be assigned permission like real files. The components of the kernel that create these files can therefore say who can read from or write to any file. Directories cannot be written (i.e., no user can add or remove a file or a directory to or from any directory in /proc).

The default kernel that comes with most (if not all) Linux distributions includes support for procfs. It cannot be compiled as a module. The associated kernel option from the configuration menu is "Filesystems → Pseudo filesystems → /proc file system support."

sysctl /proc/sys目录
sysctl(/proc/sys directory)

该接口允许用户空间读取和修改内核变量的值。您不能将它用于每个内核变量:内核必须明确说明哪些变量通过此接口可见。从用户空间,您可以通过两种方式访问​​ sysctl导出的变量。一种是sysctl系统调用(参见 man sysctl),另一种是procfs。当内核支持procfs时,它会向/proc添加一个特殊目录 ( /proc/sys ),其中包含sysctl导出的每个内核变量的文件 。

procps包附带的sysctl命令可用于配置通过sysctl接口导出的变量。该命令通过写入/proc/sys与内核对话。

大多数(如果不是全部)Linux 发行版附带的默认内核包括对sysctl的支持。它不能被编译为模块。配置菜单中的相关内核选项是“常规设置 → Sysctl 支持”。

This interface allows user space to read and modify the value of kernel variables. You cannot use it for every kernel variable: the kernel has to explicitly say what variables are visible through this interface. From user space, you can access the variables exported by sysctl in two ways. One is the sysctl system call (see man sysctl) and the other one is procfs. When the kernel has support for procfs, it adds a special directory (/proc/sys) to /proc that includes a file for each kernel variable exported by sysctl.

The sysctl command that comes with the procps package can be used to configure variables exported with the sysctl interface. The command talks to the kernel by writing to /proc/sys.

The default kernel that comes with most (if not all) Linux distributions includes support for sysctl. It cannot be compiled as a module. The associated kernel option from the configuration menu is "General setup → Sysctl support."

sysfs /sys文件系统
sysfs(/sys filesystem)

procfssysctl多年来一直被滥用,这导致了更新的文件系统的引入:sysfs。sysfs以非常干净且有组织的方式导出大量信息。您可以预期当前使用sysctl导出的部分信息将迁移到sysfs

sysfs仅适用于从 2.6 开始的内核。大多数(如果不是全部)Linux 发行版附带的默认内核包括对sysfs的支持。它不能被编译为模块。配置菜单中的关联内核选项是“文件系统 → 伪文件系统 → sysfs 文件系统支持(新)”。仅当您首先启用以下选项时,该选项才可见:“常规设置 → 配置标准内核功能(适用于小型系统)”。

您可以在 O'Reilly 最新版的《Linux Device Drivers》一书中找到对sysfs的详细分析。在第 17 章中,我们将看到桥接代码如何使用它。

procfs and sysctl have been abused over the years, and this has led to the introduction of a newer filesystem: sysfs. sysfs exports plenty of information in a very clean and organized way. You can expect part of the information currently exported with sysctl to migrate to sysfs.

sysfs is available only with kernels starting at 2.6. The default kernel that comes with most (if not all) Linux distributions includes support for sysfs. It cannot be compiled as a module. The associated kernel option from the configuration menu is "Filesystems → Pseudo filesystems → sysfs filesystem support (NEW)." The option is visible only if you first enable the following option: "General setup → Configure standard kernel features (for small systems)."

You can find a detailed analysis of sysfs in the latest edition of the O'Reilly book Linux Device Drivers. In Chapter 17, we will see how the bridging code uses it.

您还可以使用以下接口向内核发送命令,以配置某些内容或转储其他内容的配置:

You also use the following interfaces to send commands to the kernel, either to configure something or to dump the configuration of something else:

ioctl 系统调用
ioctl system call

ioctl输入/输出控制)系统调用对文件进行操作,通常用于实现标准文件系统调用未提供的特殊设备所需的操作。ioctl也可以传递套接字描述符,如套接字系统调用返回的那样,这就是网络代码使用它的方式。该接口由旧一代命令(如ifconfigRoute等)使用。

The ioctl (input/output control) system call operates on a file and is usually used to implement operations needed by special devices that are not provided by the standard filesystem calls. ioctl can be passed a socket descriptor too, as returned by the socket system call, and that is how it is used by the networking code. This interface is used by old-generation commands like ifconfig and route, among others.

网络连接套接字
Netlink socket

这是网络应用程序与内核通信的最新且首选的机制。IPROUTE2 包中的大多数命令都使用它。Linux 中的 Netlink 代表着 BSD 世界中的路由套接字。

This is the newest and preferred mechanism for networking applications to communicate with the kernel. Most commands in the IPROUTE2 package use it. Netlink represents for Linux what the routing socket represents in the BSD world.

大多数网络内核功能都可以使用 Netlink 或ioctl接口进行配置,因为内核支持较新的配置工具 (IPROUTE2) 和旧的配置工具(ifconfigroute等)。

Most network kernel features can be configured using either Netlink or ioctl interfaces, because the kernel supports both the newer configuration tools (IPROUTE2) and the legacy ones (ifconfig, route, etc.).

procfs 与 sysctl

procfs Versus sysctl

procfssysctl都导出内核内部信息,但procfs主要导出只读数据,而大多数sysctl信息也是可写的(但只能由超级用户写入)。

Both procfs and sysctl export kernel-internal information, but procfs mainly exports read-only data, while most sysctl information is writable too (but only by the superuser).

至于导出只读数据,在procfssysctl之间进行选择取决于要导出的信息量。与简单内核变量或数据结构关联的文件使用sysctl导出。其他与更复杂的数据结构相关并且可能需要特殊格式的数据通过procfs导出。后一类的示例是缓存和统计信息。

As far as exporting read-only data, the choice between procfs and sysctl depends on how much information is supposed to be exported. Files associated with a simple kernel variable or data structure are exported with sysctl. The others, which are associated with more complex data structures and may need special formatting, are exported with procfs. Examples of the latter category are caches and statistics.

进程文件系统

procfs

大多数网络功能在初始化时(在启动时或模块加载时)都会在/proc中注册一个或多个文件。当用户读取该文件时,它会导致内核间接运行一组返回某种输出的内核函数。网络代码注册的文件位于/proc/net中。

Most networking features register one or more files in /proc when they get initialized, either at boot time or at module load time. When a user reads the file, it causes the kernel to indirectly run a set of kernel functions that return some kind of output. The files registered by the networking code are located in /proc/net.

/proc中的目录可以使用 proc_mkdir. /proc/net中的文件可以使用 和 进行注册和注销proc_net_fops_createproc_net_removeinclude/linux/proc_fs.h中定义。这两个例程是通用 APIcreate_proc_entryremove_proc_entry. 特别是,proc_net_fops_create负责创建文件(使用proc_net_create)并初始化其文件操作处理程序。让我们看一个例子。

Directories in /proc can be created with proc_mkdir. Files in /proc/net can be registered and unregistered with proc_net_fops_create and proc_net_remove, defined in include/linux/proc_fs.h. These two routines are wrappers around the generic APIs create_proc_entry and remove_proc_entry. In particular, proc_net_fops_create takes care of creating the file (with proc_net_create) and initializing its file operation handlers. Let's look at an example.

这是 ARP 协议在/proc/net中注册其arp 文件的方式:

This is how the ARP protocol registers its arp file in /proc/net:

静态结构文件操作 arp_seq_fops = {
    .owner = THIS_MODULE,
    .open = arp_seq_open,
    .read = seq_read,
    .llseek = seq_lseek,
    .release = seq_release_private,
};

静态 int _ _init arp_proc_init(void)
{
    if (!proc_net_fops_create("arp", S_IRUGO, &arp_seq_fops))
        返回-ENOMEM;
    返回0;
}
static struct file_operations arp_seq_fops = {
    .owner      = THIS_MODULE,
    .open       = arp_seq_open,
    .read       = seq_read,
    .llseek     = seq_lseek,
    .release    = seq_release_private,
};

static int _ _init arp_proc_init(void)
{
    if (!proc_net_fops_create("arp", S_IRUGO, &arp_seq_fops))
        return -ENOMEM;
    return 0;
}

这三个输入参数proc_net_fops_create告诉你文件名是arp,它必须只分配读权限,文件操作处理程序集是arp_seq_ops。当用户读取文件时,数据结构的使用file_operations允许procfs将数据以块的形式返回给用户。当数据由相同类型的对象集合组成时,这非常有用。例如,ARP 缓存一次返回一项,路由表一次返回一条路由等。

The three input parameters to proc_net_fops_create tell you that the filename is arp, it must be assigned read permission only, and the set of file operation handlers is arp_seq_ops. When a user reads the file, the use of the file_operations data structure allows procfs to return data to the user in chunks. This is useful when the data consists of a collection of objects of the same type. For example, the ARP cache is returned one entry at a time, the routing table is returned one route at a time, etc.

open被初始化的例程(arp_seq_open在前面的示例中)进行了另一个重要的初始化:它注册了一个函数指针数组,其中包含procfs用于遍历要返回给用户的数据的所有例程:一个例程用于启动转储,另一个用于推进一项项目,另一个用于转储一项项目。这些例程在内部负责保存必要的上下文信息(在本例中,已转储了多少 ARP 缓存),以记住转储所在的位置并从正确的位置恢复它。

The routine to which open is initialized (arp_seq_open in the previous example) makes another important initialization: it registers an array of function pointers that includes all the routines procfs uses to walk through the data that is to be returned to the user: one routine to start the dump, another to advance one item, and another one to dump one item. Those routines internally take care of saving the necessary context information (in this example, how much of the ARP cache has been dumped already) needed to remember what point the dump is at and to resume it from the right position.

静态结构 seq_operations arp_seq_ops = {
    .start=clip_seq_start,
    .next = neigh_seq_next,
    .stop = neigh_seq_stop,
    .show = Clip_seq_show,
};

static int arp_seq_open(struct inode *inode, struct file *file)
{
    ...
    rc = seq_open(文件, &arp_seq_ops);
    ...
}
static struct seq_operations arp_seq_ops = {
    .start   = clip_seq_start,
    .next    = neigh_seq_next,
    .stop    = neigh_seq_stop,
    .show    = clip_seq_show,
};

static int arp_seq_open(struct inode *inode, struct file *file)
{
    ...
    rc = seq_open(file, &arp_seq_ops);
    ...
}

sysctl:目录/proc/sys

sysctl: Directory /proc/sys

用户看到的/proc/sys下的文件实际上是一个内核变量。对于每个变量,内核可以定义:

What the user sees as a file somewhere under /proc/sys is actually a kernel variable. For each variable, the kernel can define:

  • 将其放置在/proc/sys中的何处。与相同内核组件或功能关联的变量通常位于公共目录中。例如,在/proc/sys/net/ipv4中,您可以找到与 IPv4 相关的文件。

  • Where to place it in /proc/sys. Variables associated with the same kernel component or feature are usually located within a common directory. For instance, in /proc/sys/net/ipv4 you can find IPv4-related files.

  • 给它起什么名字。大多数时候,这些文件只是被赋予与关联的内核变量相同的名称,但有时它们的名称会被更改为更加用户友好。

  • What name to give it. Most of the time, the files are simply given the same name as the associated kernel variables, but sometimes their name is changed to be a little more user friendly.

  • 的许可。例如,任何人都可以读取文件,但只能由超级用户修改。

  • The permission. A file may, for instance, be readable by anyone but modified only by the superuser.

/proc/sys中导出的变量内容可以通过访问关联文件(前提是您具有必要的权限)来读取或写入,或者更直接地使用sysctl系统调用。

The content of the variables exported in /proc/sys can be read or written by accessing the associated file (provided that you have the necessary permissions), or more directly with the sysctl system call.

一些目录和文件是在启动时静态定义的;其他的在运行时添加。导致运行时创建目录或文件的事件示例包括:

Some directories and files are defined statically at boot time; others are added at runtime. Examples of events that lead to the runtime creation of directories or files are:

  • 当内核模块实现新功能或加载或卸载协议时。

  • When a kernel module implements a new feature or a protocol is loaded or unloaded.

  • 当新的网络设备注册或取消注册时。每个设备都有一个配置参数(以及/proc/sys中的文件)。例如,目录/proc/sys/net/ipv4/conf (在第 36 章中讨论)和/proc/sys/net/ipv4/neigh (在第 29 章中讨论)对于每个注册的网络设备都有一个子目录。

  • When a new network device is registered or unregistered. There are configuration parameters (and thus files in /proc/sys) that have one instance per device. For example, the directories /proc/sys/net/ipv4/conf (discussed in Chapter 36) and /proc/sys/net/ipv4/neigh (discussed in Chapter 29) have one subdirectory for each registered network device.

/proc/sys中的文件和目录都是用结构体定义的ctl_table结构体通过kernel/sysctl.c中定义的和函数ctl_table进行注册和注销。register_sysctl_tableunregister_sysctl_table

Both files and directories in /proc/sys are defined with ctl_table structures. ctl_table structures are registered and unregistered with the register_sysctl_table and unregister_sysctl_table functions, defined in kernel/sysctl.c.

以下是 的关键字段ctl_data

Here are the key fields of ctl_data:

const char *procname
const char *procname

将在/proc/sys中使用的文件名。

Filename that will be used in /proc/sys.

int maxlen
int maxlen

导出的内核变量的大小。

Size of the kernel variable that is exported.

mode_t mode
mode_t mode

分配给/proc/sys中关联文件或目录的权限。

Permissions to be assigned to the associated file or directory in /proc/sys.

ctl_table *child
ctl_table *child

用于建立目录和文件之间的父子关系。我们将在本节后面看到示例。

Used to build the parent-child relationships between directories and files. We will see examples later in this section.

proc_handler
proc_handler

当您读取或写入/proc/sys中的文件时执行读取或写入操作的函数。所有ctl_instances与文件关联的文件(即树的叶子)都必须已proc_handler 初始化。目录由内核分配一个默认目录。

Function that performs the read or write operation when you read from or write to a file in /proc/sys. All ctl_instances associated with files (i.e., the leaves of the tree) must have proc_handler initialized. Directories are assigned a default one by the kernel.

strategy
strategy

可以选择将函数初始化为在显示或存储数据之前执行附加数据格式化的例程。当通过系统调用访问 /proc/sys中的文件时会调用它sysctl

Function that can optionally be initialized to a routine that performs additional formatting of data before displaying or storing it. It is invoked when the file in /proc/sys is accessed with the sysctl system call.

extra1
extra1

extra2
extra2

两个可选参数通常用于定义变量的最小值和最大值。我经常将这两个参数称为最小/最大参数。

Two optional parameters commonly used to define the minimum and maximum values for the variable. I'll often refer to these two parameters as the min/max parameters.

根据与文件关联的变量类型,proc_handlerstrategy 进行不同的初始化。例如,proc_dointvecproc_handler当内核变量由一个或多个整数值组成时要使用的例程。表 3-13-2分别列出了一些可用于初始化proc_handler和 的例程strategy所有例程均在kernel/sysctl.c中定义并进行了详细注释。

Depending on what kind of variable is associated with a file, proc_handler and strategy are initialized differently. For example, proc_dointvec is the proc_handler routine to use when the kernel variable consists of one or more integer values. Tables 3-1 and 3-2 list some of the routines that can be used to initialize proc_handler and strategy, respectively. All routines are defined and well commented in kernel/sysctl.c.

表 3-1。初始化 proc_handler 的例程

Table 3-1. Routines for initializing proc_handler

功能

Function

描述

Description

proc_dostring

proc_dostring

读取/写入字符串。

Reads/writes a string.

proc_dointvec

proc_dointvec

读取/写入一个或多个整数的数组。

Reads/writes an array of one or more integers.

proc_dointvec_minmax

proc_dointvec_minmax

与 类似proc_dointvec,但确保输入落在最小/最大范围内。不符合范围的值将被拒绝。

Similar to proc_dointvec, but makes sure the input falls within a min/max range. Values that do not respect the range are rejected.

proc_dointvec_jiffies

proc_dointvec_jiffies

读取/写入整数数组。内核变量以秒表示, jiffies但在返回给用户之前会转换为秒,反之亦然。

Reads/writes an array of integers. The kernel variable is expressed in jiffies but is converted into seconds before being returned to the user, and vice versa.

proc_dointvec_ms_jiffies

proc_dointvec_ms_jiffies

读取/写入整数数组。内核变量以毫秒表示, jiffies但在返回给用户之前会转换为毫秒,反之亦然。

Reads/writes an array of integers. The kernel variable is expressed in jiffies but is converted into milliseconds before being returned to the user, and vice versa.

proc_doulongvec_minmax

proc_doulongvec_minmax

与 类似proc_dointvec_minmax,但值是长整型而不是整数。

Similar to proc_dointvec_minmax, but the values are longs rather than integers.

proc_doulongvec_ms_jiffies_minmax

proc_doulongvec_ms_jiffies_minmax

读取/写入长整型数组。内核变量以毫秒表示, jiffies但在返回给用户之前会转换为毫秒,反之亦然。必须为内核变量分配最小/最大范围内的值。

Reads/writes an array of longs. The kernel variable is expressed in jiffies but is converted into milliseconds before being returned to the user, and vice versa. The kernel variable must be assigned values within a min/max range.

表 3-2。初始化策略的例程

Table 3-2. Routines for initializing strategy

功能

Function

描述

Description

sysctl_string

sysctl_string

读取/写入字符串

Reads/writes a string

sysctl_intvec

sysctl_intvec

读取/写入整数数组并确保它们遵循最小/最大范围

Reads/writes an array of integers and makes sure that they respect the min/max range

sysctl_jiffies

sysctl_jiffies

读取/写入以秒表示的值jiffies并将其转换为秒

Reads/writes a value expressed in jiffies and converts it into seconds

sysctl_ms_jiffies

sysctl_ms_jiffies

读取/写入以毫秒表示的值jiffies并将其转换为毫秒

Reads/writes a value expressed in jiffies and converts it into milliseconds

strategy将orproc_handler函数初始化为一个例程(该例程是表 3-13-2中例程之一的包装器)的情况并不罕见。包装器可能需要添加某种逻辑或健全性检查,这取决于关联的内核变量的含义。下一节中有一个示例。

It is not uncommon for a strategy or proc_handler function to be initialized to a routine that is a wrapper around one of the routines in Tables 3-1 or 3-2. The wrapper may be necessary to add some kind of logic or sanity check that depends on the meaning of the associated kernel variable. An example is in the next section.

每当我们查看procfs接口来配置本书中介绍的任何功能时, proc_handler为了简单起见,我都会引用该函数。

Anytime we look at the procfs interface for the configuration of any of the features covered in this book, I will always refer to the proc_handler function for simplicity.

ctl_table初始化示例

Examples of ctl_table initialization

我们首先看看ctl_table文件和目录的结构初始化是什么样的,然后看看它们是如何实际使用的。

Let's first see what the initialization of a ctl_table structure for a file and a directory looks like, and then how they are actually used.

ctl_table这是用于/proc/sys/net/ipv4/conf/default/forwarding文件的实例的初始化,在net/ipv4/devinet.c中定义。它的使用在第 36 章中描述。

This is the initialization of the ctl_table instance used for the /proc/sys/net/ipv4/conf/default/forwarding file, defined in net/ipv4/devinet.c. Its use is described in Chapter 36.

  {
      .ctl_name = NET_IPV4_CONF_FORWARDING,
      .procname = "转发",
      .data = &ipv4_devconf.forwarding,
      .maxlen = sizeof(int),
      .模式=0644,
      .proc_handler = &devinet_sysctl_forward,
  }
  {
      .ctl_name      = NET_IPV4_CONF_FORWARDING,
      .procname      = "forwarding",
      .data          = &ipv4_devconf.forwarding,
      .maxlen        = sizeof(int),
      .mode          = 0644,
      .proc_handler  = &devinet_sysctl_forward,
  }

从这个快照中,您无法真正判断该文件将放置在/proc/sys中的哪个位置。稍后我们将了解如何找到该信息。从代码中可以看出,该文件名为 forwarding ,与转发文件一起导出的内核变量是ipv4_devconf.forwarding(更复杂结构中的一个字段),参数声明为整数,文件的权限是 0644(即任何人都有读权限,只有超级用户有写权限),并且例程proc_handler被初始化为devinet_sysctl_forward.

From this snapshot, you can't really tell where in /proc/sys the file will be placed. We will see in a moment how you can find that information. What you can tell from the code is that the file is called forwarding, the kernel variable whose value is exported with the forwarding file is ipv4_devconf.forwarding (a field within a more complex structure), the parameter is declared as an integer, the permissions on the file are 0644 (i.e., read permission for anyone, write permission for the superuser only), and the proc_handler routine is initialized to devinet_sysctl_forward.

现在让我们看一下kernel/sysctl.c中目录声明的示例:

Now let's see an example of a declaration of a directory from kernel/sysctl.c:

  {
      .ctl_name = CTL_NET,
      .procname =“网络”,
      .模式=0555,
      .child = 网络表,
  }
  {
      .ctl_name    = CTL_NET,
      .procname    = "net",
      .mode        = 0555,
      .child       = net_table,
  }

这是ctl_table定义目录/proc/sys/net的实例。这次不需要proc_handler了(内核提供了一个默认的,可以满足所有目录的需要),但是有一个child字段代替。child是一个指向另一个ctl_table 实例的指针,它只不过是实例列表的头元素(网络ctl_table目录中创建的每个文件或子目录都会有一个实例 )。

This is the ctl_table instance that defines the directory /proc/sys/net. No proc_handler is needed this time (the kernel provides a default one that suits the needs of all directories), but there is a child field instead. child is a pointer to another ctl_table instance, which is nothing but the head element of a list of ctl_table instances (there will be one instance for each file or subdirectory created within the net directory).

在 /proc/sys 中注册文件

Registering a file in /proc/sys

我们看到文件可以分别使用 和 注册到/ proc/sys 和从/proc/sys注销。注册函数在源代码中有详细记录,需要两个输入参数:register_sysctl_tableunregister_sysctl_table

We saw that a file can be registered to and unregistered from /proc/sys with register_sysctl_table and unregister_sysctl_table, respectively. The registration function, well documented in the source code, requires two input parameters:

  • ctl_table指向实例的指针

  • A pointer to a ctl_table instance

  • 一个标志,指示将新元素放置在ctl_table位于同一目录的实例列表中的位置:在头部 (1) 或在尾部 (0)

  • A flag that tells where to put the new element in the list of ctl_table instances located in the same directory: at the head (1) or at the tail (0)

请注意,输入不包括对/proc/sys文件系统中添加输入register_sysctl_table 的位置的引用。原因是所有插入都是在 /proc/sys目录中进行的。如果要将文件注册到/proc/sys 的子目录中,则需要通过构建树(通过与字段链接的多个实例)来提供完整路径,并将其传递给 代表该文件根的实例。你建造的树。当树的任何节点尚不存在时,就会创建它们。ctl_tablectl_tablechildregister_sysctl_tablectl_table

Note that the input to register_sysctl_table does not include a reference to the location in the /proc/sys filesystem where the input ctl_table is added. The reason is that all insertions are made into the /proc/sys directory. If you wanted to register a file into a subdirectory of /proc/sys, you would need to provide the full path by building a tree (by means of multiple ctl_table instances linked with the child field) and pass to register_sysctl_table the ctl_table instance that represents the root of the tree you have built. When any of the nodes of the tree do not exist already, they are created.

让我们举两个例子,从一个更简单的开始。drivers/scsi/scsi_sysctl.c中的这段代码 显示了如何定义文件 logging_level并将其放置在/proc/sys/dev/scsi/目录中:

Let's take two examples, starting with a simpler one. This piece of code from drivers/scsi/scsi_sysctl.c shows how the file logging_level is defined and placed in the /proc/sys/dev/scsi/ directory:

静态 ctl_table scsi_table[] = {
    { .ctl_name = DEV_SCSI_LOGGING_LEVEL,
      .procname = "logging_level",
      .data = &scsi_logging_level,
      .maxlen = sizeof(scsi_logging_level),
      .模式=0644,
      .proc_handler = &proc_dointvec },
    { }
};

静态 ctl_table scsi_dir_table[] = {
    { .ctl_name = DEV_SCSI,
      .procname =“scsi”,
      .模式=0555,
      .child = scsi_table },
    { }
};

静态 ctl_table scsi_root_table[] = {
    { .ctl_name = CTL_DEV,
      .procname = "dev",
      .模式=0555,
      .child = scsi_dir_table },
    { }
};

int _ _init scsi_init_sysctl(void)
{
    scsi_table_header = register_sysctl_table(scsi_root_table, 1) :
}
static ctl_table scsi_table[] = {
    { .ctl_name     = DEV_SCSI_LOGGING_LEVEL,
      .procname     = "logging_level",
      .data         = &scsi_logging_level,
      .maxlen       = sizeof(scsi_logging_level),
      .mode         = 0644,
      .proc_handler = &proc_dointvec },
    { }
};

static ctl_table scsi_dir_table[] = {
    { .ctl_name    = DEV_SCSI,
      .procname    = "scsi",
      .mode        = 0555,
      .child       = scsi_table },
    { }
};

static ctl_table scsi_root_table[] = {
    { .ctl_name    = CTL_DEV,
      .procname    = "dev",
      .mode        = 0555,
      .child       = scsi_dir_table },
    { }
};

int _ _init scsi_init_sysctl(void)
{
    scsi_table_header = register_sysctl_table(scsi_root_table, 1) :
}

请注意, register_sysctl_tablepass 是代码中定义的树scsi_root_table的根 。结果如图3-1ctl_table所示。

Note that register_sysctl_table is passed scsi_root_table, which is the root of the ctl_table tree defined in the code. The result is shown in Figure 3-1.

注册 /proc/sys/dev/scsi/logging_level 文件

图 3-1。注册 /proc/sys/dev/scsi/logging_level 文件

Figure 3-1. Registration of the /proc/sys/dev/scsi/logging_level file

另请注意,如果稍后您想将另一个文件添加到同一目录(例如 abc ),则需要定义一棵类似的树(即devscsi目录的两个相同ctl_table实例 ,以及新文件的一个新实例)abc)。ctl_table

Note also that if later you wanted to add another file to the same directory—say, abc—you would need to define a similar tree (i.e., the same two ctl_table instances for the dev and scsi directories, plus one new ctl_table instance for the new file abc).

为了简化向现有目录添加新文件的过程,开发人员有时会定义一个模板,并在每次将新文件添加到同一目录时重复使用该模板。使用模板的好处是,ctl_table用于导航目录的实例(例如,scsi_root_tablescsi_dir_table前面的示例中)只需要初始化一次:此后,每次添加新文件时,您将仅初始化叶节点(即真实的文件)。例如,请参阅net/core/neighbour.c中的相邻子系统如何定义neigh_sysctl_template和使用它(另请参阅第 29 章)。neigh_sysctl_register

What developers sometimes do to simplify the addition of new files to an already existing directory is to define a template and reuse it any time a new file is to be added to the same directory. The good part about using templates is that the ctl_table instances that are used to navigate the directories (e.g., scsi_root_table and scsi_dir_table in the previous example) need to be initialized only once: after that, every time you add a new file you will initialize only the leaf nodes (i.e., the real files). See, for example, how the neighboring subsystem defines neigh_sysctl_template and uses it with neigh_sysctl_register in net/core/neighbour.c (see also Chapter 29).

核心网络文件和目录

Core networking files and directories

图 3-2显示了/proc/sys中网络代码使用的主要目录。对于每个文件,它都会告诉您在哪一章描述了其文件。

Figure 3-2 shows the main directories used by the networking code in /proc/sys. For each one, it tells you in what chapter its files are described.

/proc/sys/net 中的核心目录

图 3-2。/proc/sys/net 中的核心目录

Figure 3-2. Core directories in /proc/sys/net

根据我们在上一节中看到的内容,让我们看看以net为根的树是如何在启动时定义和注册的。

Let's see, based on what we saw in the previous section, how the tree rooted in net is defined and registered at boot time.

对于图 3-2中的每个目录以及这些目录中的每个文件,都有一个 ctl_table. 图 3-3显示了图 3-2ctl_table中大多数目录的实例的定义位置以及子父关系。为了使该图更具可读性,并未包含所有目录。

For each directory in Figure 3-2, and for each file in those directories, there is an instance of ctl_table. Figure 3-3 shows where the ctl_table instances of most of the directories in Figure 3-2 are defined, and what the child-parent relationships are. Not all directories have been included to make the figure more readable.

图 3-3中的三个框显示了三个初始化示例 ctl_table。注意:

The three boxes in Figure 3-3 show three examples of ctl_table initializations. Note that:

  • netdev_max_backlog文件被分配了一个 例程,但不是一个例程。因为是一个整数,所以用户的输入是用 读取的。proc_handlerstrategynetdev_max_backlogproc_dointvec

  • The netdev_max_backlog file is assigned a proc_handler routine but not a strategy routine. Because netdev_max_backlog is an integer, the input from the user is read with proc_dointvec.

  • min_delay文件被分配了 proc_handlerstrategy例程。由于内核变量ip_rt_min_delay以秒为单位表示,jiffies而用户输入和输出以秒为单位,因此这两个例程负责将秒转换为jiffies

  • The min_delay file is assigned both the proc_handler and strategy routines. Because the kernel variable ip_rt_min_delay is expressed in jiffies but the user input and output are in seconds, the two routines take care of converting seconds to jiffies.

  • ip_local_port_range文件是一个有趣的例子。它用于允许用户配置一个范围,定义为两个值。该范围必须遵守最小值和最大值。因此, 所选的strategyproc_handler例程能够管理整数值数组(本例中为两个)。这些值extra1extra2表示范围,用于确保用户的输入遵循该范围。

  • The ip_local_port_range file is an interesting case. It is used to allow the user to configure a range, defined as two values. The range must respect a minimum and a maximum value. Therefore, the strategy and proc_handler routines selected are able to manage an array of integer values (two of them in this case). These values, extra1 and extra2, express the range and are used to make sure that the input from the user respects it.

读写控制

ioctl

在图3-4的顶部,您可以看到如何 ioctl发出呼叫。让我们看一个涉及 ifconfig的示例。

At the top of Figure 3-4, you can see how an ioctl call is issued. Let's see an example involving ifconfig.

前面我们说过ifconfig命令用来 ioctl与内核进行通信。例如,当系统管理员键入ifconfig eth0 mtu 1250之类的命令来更改接口eth0的 MTU 时,ifconfig打开一个套接字,使用从系统管理员收到的信息(data在示例中)初始化本地数据结构,并通过通过调用它到内核ioctlSIOCSIFMTU是命令标识符。

We said earlier that the ifconfig command uses ioctl to communicate with the kernel. For example, when the system administrator types a command like ifconfig eth0 mtu 1250 to change the MTU of the interface eth0, ifconfig opens a socket, initializes a local data structure with the information received from the system administrator (data in the example), and passes it to the kernel with an ioctl call. SIOCSIFMTU is the command identifier.

    结构 ifreq 数据;
    fd = 套接字(PF_INET, SOCK_DGRAM, 0);
    < ...初始化“数据”...>
    err = ioctl(fd, SIOCSIFMTU, &data);
    struct ifreq data;
    fd = socket(PF_INET, SOCK_DGRAM, 0);
    < ... initialize "data" ...>
    err = ioctl(fd, SIOCSIFMTU, &data);

ioctl命令由内核在不同的地方处理。图 3-4ioctl显示了网络代码使用的最常用命令如何由正确的函数处理程序分派sock_ioctl并路由到正确的函数处理程序。我们不会看到它sock_ioctl 是如何被调用的,也不会看到像 UDP 和 TCP 这样的传输协议是如何注册它们的处理程序的。如果您想深入研究这部分代码,可以使用该图作为起点。对于我们在本书中介绍的例程,该图提供了对正确章节的参考。

ioctl commands are processed by the kernel in different places. Figure 3-4 shows how the most common ioctl commands used by the networking code are dispatched by sock_ioctl and routed to the right function handler. We will not see how sock_ioctl is invoked or how transport protocols like UDP and TCP register their handlers. If you desire to dig into this part of the code, you can use the figure as a starting point. For the routines that we cover in this book, the figure provides a reference to the right chapter.

在 /proc/sys/net 中创建核心目录

图 3-3。在 /proc/sys/net 中创建核心目录

Figure 3-3. Creation of the core directories in /proc/sys/net

ioctl为了方便起见,图中的命令名称已被解析(拆分为组件)。例如,用于将路由添加到路由表的命令SIOCADDRT显示为 SIOC ADD RT 以强调两个有趣的组件:ADD,表示您正在添加某些内容,RT,表示路由是您要添加的内容。大多数命令都遵循此语法。通常,当给定的对象类型既可以读取也可以写入时,命令名称中就会多一个组件:G 表示 get 或 S 表示 set。例如,向接口添加和删除 IP 地址的两个命令SIOCGIFADDR和 。,我们在前面的ifconfig中看到过SIOCSIFADDRSIOCSIFMTU 例如,设置 (S) 接口 (IF) 的最大传输单元 (MTU)。SIOCSIFMTU,由 处理dev_ioctl,未出现在图 3-4中。

The name of the ioctl commands in the figure is parsed (split into components) for your convenience. For example, the command used to add a route to a routing table, SIOCADDRT, is shown as SIOC ADD RT to emphasize the two interesting components: ADD, which says you are adding something, and RT, which says a route is what you are adding. Most commands follow this syntax. Often, when a given object type can be both read and written, you have one more component in the command name: G for get or S for set. The two commands that add and remove an IP address from an interface, SIOCGIFADDR and SIOCSIFADDR, are an example. SIOCSIFMTU, which we saw in the earlier ifconfig example, sets (S) the interface's (IF) maximum transport unit (MTU). SIOCSIFMTU, which is taken care of by dev_ioctl, does not appear in Figure 3-4.

调度 ioctl 命令

图 3-4。调度 ioctl 命令

Figure 3-4. Dispatching ioctl commands

网络ioctl命令列在include/linux/sockios.h中。设备驱动程序可以定义新的(私有)命令,其代码范围在+15SIOCDEVPRIVATE之间SIOCDEVPRIVATE。例如,请参阅如何在include/linux/if_tunnel.h中定义与(虚拟)隧道设备一起使用的四个私有命令。然而,不推荐使用ioctl并且不鼓励使用私有命令。

Networking ioctl commands are listed in include/linux/sockios.h. Device drivers can define new (private) commands with codes in the range SIOCDEVPRIVATE through SIOCDEVPRIVATE+15. See, for example, how the four private commands used with (virtual) tunnel devices are defined in include/linux/if_tunnel.h. The use of private ioctl commands is deprecated and discouraged, however.

SIOCPROTOPRIVATE协议还可以在 +15范围内定义私有命令SIOCPROTOPRIVATE

Protocols can also define private commands in the range SIOCPROTOPRIVATE through SIOCPROTOPRIVATE+15.

网联

Netlink

RFC 3549 中有详细描述的 Netlink 套接字代表了用户空间和内核之间用于 IP 网络配置的首选接口。Netlink 还可以用作内核内以及多个用户空间进程之间的消息传递系统。

The Netlink socket, well described in RFC 3549, represents the preferred interface between user space and kernel for IP networking configuration. Netlink can also be used as an intrakernel messaging system as well as between multiple user-space processes.

通过 Netlink 套接字,您可以使用标准套接字 API 来打开、关闭、传输套接字以及从套接字接收数据。我们快速回顾一下socket系统调用的原型:

With Netlink sockets you can use the standard socket APIs to open, close, transmit on, and receive from a socket. Let's quickly review the prototype of the socket system call:

int套接字(int域,int类型,int协议)
int socket(int domain, int type, int protocol)

有关 TCP/IP 套接字(即 domain )的三个参数初始化的详细信息PF_INET,您可以使用man socket命令。

For details on what the three arguments are initialized to with TCP/IP sockets (i.e., domain PF_INET), you can use the man socket command.

与任何其他套接字一样,当您打开 Netlink 套接字时,您需要提供 domaintypeprotocol参数。Netlink 使用新的PF_NETLINK协议族(域),仅支持 SOCK_DGRAM类型,并定义了多个协议,每个协议用于网络堆栈的不同组件(或一组组件)。例如,该NETLINK_ROUTE协议用于大多数网络功能,例如路由和邻居协议,并NETLINK_FIREWALL用于防火墙(Netfilter)。Netlink 协议列在include/linux/netlink.h 的NETLINK_ XXX枚举列表中。

As with any other socket, when you open a Netlink socket, you need to provide the domain, type, and protocol arguments. Netlink uses the new PF_NETLINK protocol family (domain), supports only the SOCK_DGRAM type, and defines several protocols, each one used for a different component (or a set of components) of the networking stack. For example, the NETLINK_ROUTE protocol is used for most networking features, such as routing and neighboring protocols, and NETLINK_FIREWALL is used for the firewall (Netfilter). The Netlink protocols are listed in the NETLINK_ XXX enumeration list in include/linux/netlink.h.

对于 Netlink 套接字,端点通常由打开套接字的进程 ID (PID) 来标识,其中特殊值 0 标识内核。Netlink 的功能之一是能够发送单播和多播消息:目标端点地址可以是 PID、多播组 ID 或两者的组合。内核定义了 Netlink 多播组,用于发送有关特定类型事件的通知,并且用户程序如果对这些组感兴趣,可以注册到这些组。RTMGRP_ XXX这些组列在include/linux/rtnetlink.h 的枚举列表中。其中有RTMGRP_IPV4_ROUTERTMGRP_NEIGH组,分别用于有关路由表和 L3 到 L2 地址映射更改的通知。我们将在第六部分和第七部分中看到如何使用这两个组。

With Netlink sockets, endpoints are usually identified by the ID of the process that opened the sockets (PID), where the special value 0 identifies the kernel. Among Netlink's features is the ability to send both unicast and multicast messages: the destination endpoint address can be a PID, a multicast group ID, or a combination of the two. The kernel defines Netlink multicast groups for the purpose of sending out notifications about particular kinds of events, and user programs can register to those groups if they are interested in them. The groups are listed in the enumeration list RTMGRP_ XXX in include/linux/rtnetlink.h. Among them are the RTMGRP_IPV4_ROUTE and RTMGRP_NEIGH groups, used respectively for notifications regarding changes to the routing tables and to the L3-to-L2 address mappings. We will see how these two groups are used in Parts VI and VII.

另一个有趣的功能是能够发送积极和消极的确认。

Another interesting feature is the ability to send both positive and negative acknowledgments.

Netlink的优势之一与其他用户内核接口相比,例如ioctl内核可以发起传输,而不仅仅是返回信息来响应用户空间请求。

One of the advantages of Netlink over other user-kernel interfaces such as ioctl is that the kernel can initiate a transmission instead of just returning information in answer to user-space requests.

序列化配置更改

Serializing Configuration Changes

每当您应用配置更改时,内核内处理该更改的处理程序都会获取一个信号量 ( rtnl_sem),以确保对存储网络配置的数据结构进行独占访问。无论配置是通过ioctl还是 Netlink 应用,都是如此。

Any time you apply a configuration change, the handler that takes care of it inside the kernel acquires a semaphore (rtnl_sem) that ensures exclusive access to the data structures that store the networking configuration. This is true regardless of whether the configuration is applied via ioctl or Netlink.

第二部分。系统初始化

Part II. System Initialization

在本书的这一部分中,我们将了解网络设备如何以及何时初始化并向内核注册。我将特别强调外围组件互连 (PCI) 设备,因为它们越来越常见,而且有特殊要求。

In this part of the book, we will see how and when network devices are initialized and registered with the kernel. I'll put special emphasis on Peripheral Component Interconnect (PCI) devices, both because they are increasingly common and because they have special requirements.

在网络启动和运行之前,必须完成与网络接口卡 (NIC) 相关的许多任务。首先,需要初始化关键的内核组件。然后设备驱动程序必须初始化并注册它们负责的所有设备,并分配内核用于与它们通信的资源(IRQ、I/O 端口等)。

Many tasks related to the network interface card (NIC) have to be accomplished before getting a network up and running. First, key kernel components need to be initialized. Then device drivers must initialize and register all the devices they are responsible for and allocate the resources the kernel will use to communicate with them (IRQ, I/O ports, etc.).

区分两种注册很重要。首先,当发现一个设备时,它会作为通用设备向内核注册。其次,NIC 设备在网络堆栈中注册为网络设备。例如,PCI 以太网卡既在 PCI 层中注册为通用 PCI 设备,又在网络堆栈中注册为以太网卡(设备的名称如eth0 )。第 6 章介绍了第一种注册,第 8 章介绍了第二种注册。

It's important to distinguish between two kinds of registration. First, when a device is discovered, it is registered with the kernel as a generic device. Second, an NIC device is registered with the network stack as a network device. For example, a PCI Ethernet card is registered both as a generic PCI device with the PCI layer, and as an Ethernet card (where the device gets a name such as eth0) with the network stack. The first kind of registration is covered in Chapter 6 and the second in Chapter 8.

以下是每章的内容:

Here is what is covered in each chapter:

第 4 章 通知链
Chapter 4 Notification Chains

内核组件用来相互通知特定事件的机制。

The mechanism that kernel components use to notify each other about specific events.

第五章 网络设备初始化
Chapter 5 Network Device Initialization

网络设备如何初始化。

How network devices are initialized.

第 6 章 PCI 层和网络接口卡
Chapter 6 The PCI Layer and Network Interface Cards

PCI 设备驱动程序如何向内核注册,以及 PCI 设备如何识别并与其驱动程序关联。

How PCI device drivers register with the kernel, and how PCI devices are identified and associated with their drivers.

第 7 章 组件初始化的内核基础结构
Chapter 7 Kernel Infrastructure for Component Initialization

确保在引导时或模块加载时调用必要的初始化函数的内核机制。我们将了解如何使用特殊宏来标记初始化例程以优化内存使用,从而减少内核映像的大小。我们还将了解如何向内核传递引导选项以及如何使用这些选项来配置 NIC。

The kernel mechanism that ensures that the necessary initialization functions are invoked at boot time or module load time. We'll learn how initialization routines can be tagged with special macros to optimize memory usage and therefore reduce the size of the kernel image. We will also see how the kernel can be passed boot options and how these can be used to configure NICs.

第8章 设备注册与初始化
Chapter 8 Device Registration and Initialization

设备如何向内核注册并初始化。

How devices are registered with the kernel and initialized.

第 4 章通知链

Chapter 4. Notification Chains

内核的许多子系统都高度相互依赖,因此其中一个子系统检测或生成的事件可能会引起其他子系统的兴趣。为了满足交互的需要,Linux 使用所谓的通知链

The kernel's many subsystems are heavily interdependent, so an event detected or generated by one of them could be of interest to others. To fulfill the need for interaction, Linux uses so-called notification chains .

在本章中,我们将看到:

In this chapter, we will see:

  • 通知链如何声明以及网络代码定义了哪些链

  • How notification chains are declared and what chains are defined by the networking code

  • 内核子系统如何注册到通知链

  • How a kernel subsystem can register to a notification chain

  • 内核子系统如何在链上生成通知

  • How a kernel subsystem generates a notification on a chain

请注意,通知链仅在内核子系统之间使用。内核和用户空间之间的通知依赖于其他机制,例如第 3 章中介绍的机制。

Note that notification chains are used only between kernel subsystems. Notifications between kernel and user space rely on other mechanisms, such as those introduced in Chapter 3.

通知链的原因

Reasons for Notification Chains

假设我们的 Linux 路由器如图 4-1所示,有四个接口。该图显示了路由器与五个网络之间的关系,以及其路由表的简化版本。

Suppose we had the Linux router in Figure 4-1 with four interfaces. The figure shows the relationship between the router and five networks, along with a simplified version of its routing table.

让我们看一下图 4-1中的一些拓扑示例。网络 A 在接口eth0上直接连接到 RT ,网络 F 没有直接连接到 RT,但 RT 的eth3直接连接到另一个具有地址 IP1 接口的路由器,并且该第二个路由器知道如何到达网络 F。其他情况类似。简而言之,一些网络是直接连接的,而另一些网络则需要一个或多个附加路由器的帮助才能到达。

Let's look at some examples of the topology in Figure 4-1. Network A is directly connected to RT on interface eth0, and network F is not directly connected to RT, but RT's eth3 is directly connected to another router that has an interface with address IP1, and that second router knows how to reach network F. The other cases are similar. In short, some networks are directly connected and others require the help of one or more additional routers to be reached.

有关路由代码如何处理这种情况的详细说明,请参阅 第七部分。在本章中,我们将集中讨论通知链的作用。假设接口eth3由于网络中断、管理命令(例如ifconfig eth3 down)或硬件故障而关闭。RT 无法访问网络 D、E 和 F(以及依赖 RT 进行连接的 A、B 和 C 中的系统),因此应将其从路由表中删除。谁将告诉路由子系统该接口故障?通知链。

For a detailed description of how the routing code handles this situation, refer to Part VII. In this chapter, we will concentrate on the role of notification chains. Suppose that interface eth3 went down, due to a break in the network, an administrative command (such as ifconfig eth3 down) or a hardware failure. Networks D, E, and F would become unreachable by RT (and by systems in A, B, and C relying on RT for their connections) and should be removed from the routing table. Who is going to tell the routing subsystem about that interface failure? A notification chain.

Linux 路由器示例

图 4-1。Linux 路由器示例

Figure 4-1. Example of Linux router

图 4-2显示了一个稍微复杂的示例,其中路由子系统与动态路由协议交互,该协议可以根据网络拓扑调整一个或多个路由表[ * ] ,从而在拓扑允许时处理接口故障(即,当存在冗余路径时)。

Figure 4-2 shows a slightly more complicated example where the routing subsystem interacts with dynamic routing protocols—protocols that can adjust the routing table or tables[*] to the network topology and therefore cope with interface failures when the topology allows it (i.e., when there are redundant paths).

使用动态路由协议的 Linux 路由器示例

图 4-2。使用动态路由协议的 Linux 路由器示例

Figure 4-2. Example of a Linux router with dynamic routing protocols

图4-2中,RT可以通过网络A和网络E到达网络F。最初选择E是因为其成本较小,[ ]但现在E不再可达,路由表应该更新网络 F 通过网络 A 的路由。这种决策的基础可能包括本地主机事件,例如设备注册和取消注册,以及路由器配置和所使用的路由协议中的复杂因素。无论如何,管理表的路由子系统必须由其他子系统通知相关信息,这证明了通知链的必要性。

In Figure 4-2, network F could be reached by RT by passing through both network A and network E. E was chosen initially because of its smaller cost,[] but now that E is no longer reachable, the routing table should update the route for network F to go through network A. The basis for such a decision could include local host events, such as device registration and unregistration, as well as complex factors in router configuration and the routing protocols used. In any case, the routing subsystem that manages the tables must be informed of the relevant information by some other subsystem, demonstrating the need for notification chains.

概述

Overview

通知链只是发生给定事件时要执行的函数列表。每个函数都让另一个子系统了解调用该函数的子系统内发生的事件或由该子系统检测到的事件。

A notification chain is simply a list of functions to execute when a given event occurs. Each function lets one other subsystem know about an event that occurred within, or was detected by, the subsystem calling the function.

因此,对于每个通知链,都有一个被动方(被通知方)和一个主动方(通知方),如所谓的发布订阅模型:

Thus, for each notification chain there is a passive side (the notified) and an active side (the notifier), as in the so-called publish-and-subscribe model:

  • 通知的子系统要求获得有关事件的通知并提供要调用的回调函数。

  • The notified are the subsystems that ask to be notified about the event and that provide a callback function to invoke.

  • 通知程序是经历事件并调用回调函数的子系统。

  • The notifier is the subsystem that experiences an event and calls the callback function.

执行的功能由通知的子系统选择。链的所有者(生成通知的子系统)永远不会决定执行哪些功能。所有者只需定义列表即可;任何内核子系统都可以向该链注册回调函数来接收通知。

The functions executed are chosen by the notified subsystems. It is never up to the owner of the chain (the subsystem that generates the notifications) to decide what functions to execute. The owner simply defines the list; any kernel subsystem can register a callback function with that chain to receive the notification.

通知链的使用使得源代码更容易编写和维护。想象一下通用例程如何在不使用通知链的情况下通知外部子系统有关事件的信息:

The use of notification chains makes the source code easier to write and maintain. Imagine how a generic routine might notify external subsystems about an event without using notification chains:

如果(子系统_X_启用){
    做某事_1
}
如果(子系统_Y_启用){
    做某事2
}
如果(子系统_Z_启用){
    做某事_3
}
…………
If (subsystem_X_enabled) {
    do_something_1
}
if (subsystem_Y_enabled) {
    do_something_2
}
If (subsystem_Z_enabled) {
    do_something_3
}
... ... ...

换句话说,必须为可能对事件感兴趣的每个可能的子系统包含一个条件子句,并且每次其他人向内核添加子系统时,该子系统的维护者都必须添加一个新子句。

In other words, a conditional clause would have to be included for every possible subsystem that might be interested in an event, and the maintainer of this subsystem would have to add a new clause every time somebody else added a subsystem to the kernel.

子系统维护者不应该跟踪添加到内核的每个子系统。然而,每个子系统维护人员应该知道:

No subsystem maintainer is expected to keep track of every subsystem added to the kernel. However, each subsystem maintainer should know:

  • 他感兴趣的其他子系统的事件类型

  • The kinds of events from other subsystems he is interested in

  • 他知道的事件类型以及其他子系统可能感兴趣的事件类型

  • The kinds of events he knows about and that other subsystems may be interested in

因此,通知链允许每个子系统与其他子系统共享事件的发生,而不必知道其他子系统是什么以及他们为什么感兴趣。

Thus, notification chains allow each subsystem to share the occurrence of an event with others, without having to know what the others are and why they are interested.

定义链

Defining a Chain

通知链列表的元素类型为 notifier_block,其定义如下:

The elements of the notification chain's list are of type notifier_block, whose definition is the following:

结构notifier_block
{
    int (*notifier_call)(struct notifier_block *self, unsigned long, void *);
    结构notifier_block *下一个;
    int 优先级;
};
struct notifier_block
{
    int (*notifier_call)(struct notifier_block *self, unsigned long, void *);
    struct notifier_block *next;
    int priority;
};

notifier_call是要执行的函数,next用于将列表的元素链接在一起,并 priority表示函数的优先级。优先级较高的函数首先执行。但实际上,几乎所有注册都将其排除在定义priority之外notifier_block,这意味着它获得默认值0,并且执行顺序最终仅取决于注册顺序(即,它是半随机顺序)。的返回值notifier_call将在下一节“通知链上的事件”中列出。

notifier_call is the function to execute, next is used to link together the elements of the list, and priority represents the priority of the function. Functions with higher priority are executed first. But in practice, almost all registrations leave the priority out of the notifier_block definition, which means it gets the default value of 0 and execution order ends up depending only on the registration order (i.e., it is a semirandom order). The return values of notifier_call are listed in the upcoming section, "Notifying Events on a Chain."

实例的常用名称notifier_block是 、和。xxx _chainxxx _notifier_chainxxx _notifier_list

Common names for notifier_block instances are xxx _chain, xxx _notifier_chain, and xxx _notifier_list.

注册连锁

Registering with a Chain

当内核组件对给定通知链的事件感兴趣时,它可以将其注册到通用函数中notifier_chain_register。内核还提供了一组包装器 notifier_chain_register,其中一些如 表 4-1所示。

When a kernel component is interested in the events of a given notification chain, it can register it with the general function notifier_chain_register. The kernel also provides a set of wrappers around notifier_chain_register, some of which are shown in Table 4-1.

表 4-1列出了主要 API 以及用于注册和取消注册到三个链的相关包装器inetaddr_chain ,inet6addr_chain , 和netdev_chain.

Table 4-1 lists the main APIs and the associated wrappers used to register and unregister to the three chains inetaddr_chain , inet6addr_chain , and netdev_chain.

表 4-1。一些链的主要功能和包装器

Table 4-1. Main functions and wrappers for a few chains

手术

Operation

函数原型

Function prototype

登记

Registration

int notifier_chain_register(struct notifier_block **list, struct notifier_block *n)

int notifier_chain_register(struct notifier_block **list, struct notifier_block *n)

 

包装纸

Wrappers

 

inetaddr_chain

inetaddr_chain

register_inetaddr_notifier

register_inetaddr_notifier

 

inet6addr_chain

inet6addr_chain

register_inet6addr_notifier

register_inet6addr_notifier

 

netdev_chain

netdev_chain

register_netdevice_notifier

register_netdevice_notifier

注销

Unregistration

int notifier_chain_unregister(struct notifier_block **nl, struct notifier_block *n)

int notifier_chain_unregister(struct notifier_block **nl, struct notifier_block *n)

 

包装纸

Wrappers

 

inetaddr_chain

inetaddr_chain

unregister_inetaddr_notifier

unregister_inetaddr_notifier

 

inet6addr_chain

inet6addr_chain

unregister_inet6addr_notifier

unregister_inet6addr_notifier

 

netdev_chain

netdev_chain

unregister_netdevice_notifier

unregister_netdevice_notifier

通知

Notification

int notifier_call_chain(struct notifier_block **n, unsigned long val, void *v)

int notifier_call_chain(struct notifier_block **n, unsigned long val, void *v)

对于每个链,notifier_block实例都会插入到一个列表中,该列表按优先级排序。具有相同优先级的元素根据插入时间排序:新元素位于尾部。

For each chain, the notifier_block instances are inserted into a list, which is sorted by priority. Elements with the same priority are sorted based on insertion time: new ones go to the tail.

对通知链的访问受到notifier_lock锁的保护。对所有通知链使用单个锁并不是一个很大的限制,也不会影响性能,因为子系统通常notifier_call仅在启动时或模块加载时注册其功能,并且从那时起以读取方式访问列表。唯一的方式(即共享)。

Accesses to the notification chains are protected by the notifier_lock lock. The use of a single lock for all the notification chains is not a big constraint and does not affect performance, because subsystems usually register their notifier_call functions only at boot time or at module load time, and from that moment on access the lists in a read-only manner (that is, shared).

由于notifier_chain_register调用该函数是为了将回调插入到所有列表中,因此需要将列表指定为输入参数。然而,这个函数很少被直接调用;而是使用通用包装器。

Because the notifier_chain_register function is called to insert callbacks into all lists, it requires that the list be specified as an input parameter. However, this function is rarely called directly; generic wrappers are used instead.

int notifier_chain_register(struct notifier_block **list, struct notifier_block *n)
{
    write_lock(¬ifier_lock);
    而(*列表)
    {
        if(n->优先级 > (*list)->优先级)
            休息;
        列表= &((*列表)->下一个);
    }
    n->下一个= *列表;
    *列表=n;
    write_unlock(¬ifier_lock);
    返回0;
}
int notifier_chain_register(struct notifier_block **list, struct notifier_block *n)
{
    write_lock(&notifier_lock);
    while(*list)
    {
        if(n->priority > (*list)->priority)
            break;
        list= &((*list)->next);
    }
    n->next = *list;
    *list=n;
    write_unlock(&notifier_lock);
    return 0;
}

通知链上的事件

Notifying Events on a Chain

通知是通过kernel/sys.cnotifier_call_chain中定义的生成的。该函数只是按照优先级顺序调用针对链注册的所有回调例程。请注意,回调例程是在调用 的进程的上下文中执行的。然而,可以实现回调例程,以便它将通知排队到某处并唤醒将查看它的进程。notifier_call_chain

Notifications are generated with notifier_call_chain, defined in kernel/sys.c. This function simply invokes, in order of priority, all the callback routines registered against the chain. Note that callback routines are executed in the context of the process that calls notifier_call_chain. A callback routine could, however, be implemented so that it queues the notification somewhere and wakes up a process that will look at it.

int notifier_call_chain(struct notifier_block **n, unsigned long val, void *v)
{
    int ret = NOTIFY_DONE;
    结构notifier_block *nb = *n;
 
    同时(注意)
    {
        ret = nb->notifier_call(nb, val, v);
        if (ret & NOTIFY_STOP_MASK)
        {
            返回ret;
        }
        nb = nb->下一个;
    }
    返回ret;
}
int notifier_call_chain(struct notifier_block **n, unsigned long val, void *v)
{
    int ret = NOTIFY_DONE;
    struct notifier_block *nb = *n;
 
    while (nb)
    {
        ret = nb->notifier_call(nb, val, v);
        if (ret & NOTIFY_STOP_MASK)
        {
            return ret;
        }
        nb = nb->next;
    }
    return ret;
}

这是它的三个输入参数的含义:

This is the meaning of its three input parameters:

n
n

通知链。

Notification chain.

val
val

事件类型。链本身标识一类事件;val明确标识事件类型(即 NETDEV_REGISTER)。

Event type. The chain itself identifies a class of events; val unequivocally identifies an event type (i.e., NETDEV_REGISTER).

v
v

可由各种客户端注册的处理程序使用的输入参数。这可以在不同的情况下以不同的方式使用。例如,当新的网络设备向内核注册时,相关的通知用于v识别net_device数据结构。

Input parameter that can be used by the handlers registered by the various clients. This can be used in different ways under different circumstances. For instance, when a new network device is registered with the kernel, the associated notification uses v to identify the net_device data structure.

调用的回调例程可以返回include/linux/notifier.h中定义的notifier_call_chain 任何值:NOTIFY_ XXX

The callback routines called by notifier_call_chain can return any of the NOTIFY_ XXX values defined in include/linux/notifier.h:

NOTIFY_OK
NOTIFY_OK

通知已正确处理。

Notification was processed correctly.

NOTIFY_DONE
NOTIFY_DONE

对通知不感兴趣。[ * ]

Not interested in the notification.[*]

NOTIFY_BAD
NOTIFY_BAD

出了些问题。停止调用该事件的回调例程。

Something went wrong. Stop calling the callback routines for this event.

NOTIFY_STOP
NOTIFY_STOP

例程调用正确。但是,无需为此事件调用进一步的回调。

Routine invoked correctly. However, no further callbacks need to be called for this event.

NOTIFY_STOP_MASK
NOTIFY_STOP_MASK

检查此标志以notifier_call_chain 确定是否停止调用回调例程,或继续调用。和NOTIFY_BADNOTIFY_STOP在其定义中包含此标志。

This flag is checked by notifier_call_chain to see whether to stop invoking the callback routines, or keep going. Both NOTIFY_BAD and NOTIFY_STOP include this flag in their definitions.

notifier_call_chain捕获并返回最后调用的回调例程收到的返回值。无论是否已调用所有回调,或者其中一个回调由于返回值 或 而中断循环,情况都是NOTIFY_BAD如此NOTIFY_STOP

notifier_call_chain captures and returns the return value received by the last callback routine invoked. This is true regardless of whether all the callbacks have been invoked, or one of them interrupted the loop due to a return value of NOTIFY_BAD or NOTIFY_STOP.

请注意,可以notifier_call_chain同时在不同的 CPU 上调用相同的通知链。回调函数负责在需要时处理互斥和序列化。

Note that it is possible for notifier_call_chain to be called for the same notification chain on different CPUs at the same time. It is the responsibility of the callback functions to take care of mutual exclusion and serialization where needed.

网络子系统的通知链

Notification Chains for the Networking Subsystems

内核定义了至少 10 个不同的通知链。在这里,我们感兴趣的是那些用于发出对网络代码特别重要的事件信号的事件。主要有:

The kernel defines at least 10 different notification chains. Here we are interested in the ones that are used to signal events of particular importance to the networking code. The main ones are:

inetaddr_chain
inetaddr_chain

发送有关在本地接口上插入、删除和更改 Internet 协议版本 4 (IPv4) 地址的通知。第 23 章描述了何时生成此类通知。Internet 协议版本 6 (IPv6) 使用类似的链 (inet6addr_chain )。

Sends notifications about the insertion, removal, and change of an Internet Protocol Version 4 (IPv4) address on a local interface. Chapter 23 describes when such notifications are generated. Internet Protocol Version 6 (IPv6) uses a similar chain (inet6addr_chain ).

netdev_chain
netdev_chain

发送有关网络设备注册状态的通知。第 8 章描述了何时生成此类通知。

Sends notifications about the registration status of network devices. Chapter 8 describes when such notifications are generated.

对于这些链以及网络子系统使用的其他链,它们的目的和用途在有关相关通知器子系统的章节中进行了描述。

For these chains, and others used by the networking subsystems, their purposes and uses are described in the chapter about the relevant notifier subsystem.

网络代码也可以注册其他内核组件生成的通知。例如,某些 NIC 设备驱动程序会向该reboot_notifier_list链注册,该链会在系统即将重新启动时发出警告。

The networking code can register to notifications generated by other kernel components, too. For example, some NIC device drivers register with the reboot_notifier_list chain, which is a chain that warns when the system is about to reboot.

包装纸

Wrappers

大多数通知链都附带一组包装器,用于注册和取消注册。例如,这是用于注册的包装器netdev_chain

Most notification chains come with a set of wrappers used to register to them and unregister from them. For example, this is the wrapper used to register to netdev_chain:

int register_netdevice_notifier(struct notifier_block *nb)
{
        返回notifier_chain_register(&netdev_chain, nb);
}
int register_netdevice_notifier(struct notifier_block *nb)
{
        return notifier_chain_register(&netdev_chain, nb);
}

包装器的常见名称包括 [ un] 、[ ]和[ ] 。register_ xxx _notifierxxx _unregister_notifierxxx _unregister

Common names for wrappers include [un]register_ xxx _notifier, xxx _[un]register_notifier, and xxx _[un]register.

例子

Examples

通知链的注册通常在感兴趣的内核组件初始化时发生。例如,来自net/ipv4/fib_frontend.c的以下快照显示,这是路由代码使用的初始化例程,在第 32 章的“路由子系统初始化ip_fib_init部分中进行了描述:

Registrations to notification chains usually take place when the interested kernel component is initialized. For example, the following snapshot from net/ipv4/fib_frontend.c shows ip_fib_init, which is the initialization routine used by the routing code that is described in the section "Routing Subsystem Initialization" in Chapter 32:

静态结构 notifier_block fib_inetaddr_notifier = {
    .notifier_call = fib_inetaddr_event,
};
 
静态结构 notifier_block fib_netdev_notifier = {
    .notifier_call = fib_netdev_event,
};
 
无效__init ip_fib_init(无效)
{
    …………
    register_netdevice_notifier(&fib_netdev_notifier);
    register_inetaddr_notifier(&fib_inetaddr_notifier);
}
static struct notifier_block fib_inetaddr_notifier = {
    .notifier_call = fib_inetaddr_event,
};
 
static struct notifier_block fib_netdev_notifier = {
    .notifier_call = fib_netdev_event,
};
 
void _ _init ip_fib_init(void)
{
    ... ... ...
    register_netdevice_notifier(&fib_netdev_notifier);
    register_inetaddr_notifier(&fib_inetaddr_notifier);
}

路由代码注册到前面部分“网络子系统的通知链”中介绍的两条链。路由表会受到本地配置的 IP 地址的更改和本地设备注册状态的更改的影响。

The routing code registers to both of the chains introduced in the earlier section, "Notification Chains for the Networking Subsystems." The routing tables are affected both by changes to locally configured IP addresses and by changes to the registration status of local devices.

通过 /proc 文件系统进行调整

Tuning via /proc Filesystem

就本章而言,/proc中没有感兴趣的文件。

There is no file of interest in /proc as far as this chapter is concerned.

本章介绍的函数和变量

Functions and Variables Featured in This Chapter

表 4-2总结了本章介绍的函数和数据结构。

Table 4-2 summarizes the functions and data structures introduced in this chapter.

表 4-2。用于通知链的函数、宏和数据结构

Table 4-2. Functions, macros, and data structures used for notification chains

姓名

Name

描述

Description

函数和宏

Functions and macros

notifier_chain_register+ 包装纸

notifier_chain_register + wrappers

notifier_chain_unregister+ 包装纸

notifier_chain_unregister + wrappers

notifier_call_chain

notifier_call_chain

前两个函数为通知链注册和取消注册回调处理程序。第三个发送有关特定类别中的事件的所有通知。

The first two functions register and unregister a callback handler for a notification chain. The third sends out all the notifications about events in a specific class.

数据结构

Data structure

struct notifier_block

struct notifier_block

定义通知的处理程序。它包括要调用的回调函数。

Defines the handler for a notification. It includes the callback function to invoke.

本章介绍的文件和目录

Files and Directories Featured in This Chapter

图 4-3 列出了本章中引用的文件。

Figure 4-3 lists the files referred to in this chapter.

与通知链相关的文件

图 4-3。与通知链相关的文件

Figure 4-3. Files related to notification chains




[ * ]可以同时拥有多个路由表。我们将在第 31 章中介绍这个功能。

[*] It is possible to have multiple routing tables at the same time. We will cover this feature in Chapter 31.

[ ]链路成本是路由协议可用于比较链路并在其中进行选择的指标之一。参见第 30 章

[] The cost of a link is one of the metrics that routing protocols can use to compare links and choose among them. See Chapter 30.

[ * ]该返回值有时会被错误地用来代替NOTIFY_OK

[*] This return value is sometimes improperly used in place of NOTIFY_OK.

第 5 章网络设备初始化

Chapter 5. Network Device Initialization

现代操作系统的灵活性给初始化带来了复杂性。首先,设备驱动程序可以作为内核的模块或静态组件加载。此外,设备可以在启动时存在或在运行时插入(和删除):后一种类型的设备称为 热插拔设备 设备,包括 USB、PCI CardBus、IEEE 1394(Apple 也称为 FireWire)等。我们将看到热插拔如何影响内核和用户空间中发生的情况。

The flexibility of modern operating systems introduces complexity into initialization . First, a device driver can be loaded as either a module or a static component of the kernel. Furthermore, devices can be present at boot time or inserted (and removed) at runtime: the latter type of device, called a hot-pluggable device, includes USB, PCI CardBus, IEEE 1394 (also called FireWire by Apple), and others. We'll see how hot-plugging affects what happens in both the kernel and the user space.

在第一章中,我们将介绍:

In this first chapter, we will cover:

  • 一段核心网络代码初始化。

  • A piece of the core networking code initialization.

  • 的初始化一个网卡。

  • The initialization of an NIC.

  • NIC 如何使用中断,以及如何分配和释放 IRQ 处理程序。我们还将了解驱动程序如何共享 IRQ。

  • How an NIC uses interrupts, and how IRQ handlers can be allocated and released. We will also look at how drivers can share IRQs.

  • 用户如何向作为模块加载的设备驱动程序提供配置参数。

  • How the user can provide configuration parameters to device drivers loaded as modules.

  • 设备初始化和配置期间用户空间和内核之间的交互。我们将了解内核如何运行用户空间帮助程序来加载 NIC 的正确设备驱动程序或应用用户空间配置。我们将特别关注热插拔功能。

  • Interaction between user space and kernel during device initialization and configuration. We will look at how the kernel can run a user-space helper to either load the correct device driver for an NIC or apply a user-space configuration. In particular, we will look at the Hotplug feature.

  • 虚拟设备在配置和与内核交互方面与真实设备有何不同。

  • How virtual devices differ from real ones with regard to configuration and interaction with the kernel.

系统初始化概述

System Initialization Overview

了解主要网络相关子系统(包括设备驱动程序)在何处以及如何初始化非常重要。然而,由于本书只关注此类初始化的网络方面,因此我不会涵盖一般的设备驱动程序或通用内核服务(例如内存管理)。为了了解该背景,我建议您阅读O'Reilly 出版的《Linux 设备驱动程序》《了解 Linux 内核》 。

It's important to know where and how the main network-related subsystems are initialized, including device drivers. However, because this book is concerned only with the networking aspect of such initializations, I will not cover device drivers in general, or generic kernel services (e.g., memory management). For an understanding of that background, I recommend that you read Linux Device Drivers and Understanding the Linux Kernel, both published by O'Reilly.

图 5-1简要显示了一些内核子系统在引导时初始化的位置和顺序(请参阅 init/main.c)。

Figure 5-1 shows briefly where, and in what sequence, some of the kernel subsystems are initialized at boot time (see init/main.c).

内核初始化

图 5-1。内核初始化

Figure 5-1. Kernel initialization

当内核启动时,它会执行start_kernel,初始化一系列子系统,如图5-1所示。在start_kernel终止之前,它调用init内核线程,该线程负责其余的初始化。与本章相关的大部分初始化活动恰好都在里面do_basic_setup

When the kernel boots up, it executes start_kernel, which initializes a bunch of subsystems, as partially shown in Figure 5-1. Before start_kernel terminates, it invokes the init kernel thread, which takes care of the rest of the initializations. Most of the initialization activities related to this chapter happen to be inside do_basic_setup.

在各种初始化任务中,我们主要感兴趣的是三个:

Among the various initialization tasks, we are mainly interested in three:

启动时选项
Boot-time options

对 的两次调用parse_args(一次直接调用parse_early_param,一次通过间接调用)处理引导加载程序(例如 LILO 或 GRUB)在引导时传递给内核的配置参数。我们将在“启动时内核选项”部分中了解如何处理此任务。

Two calls to parse_args, one direct and one indirect via parse_early_param, handle configuration parameters that a boot loader such as LILO or GRUB has passed to the kernel at boot time. We will see how this task is handled in the section "Boot-Time Kernel Options."

中断和定时器
Interrupts and timers

硬件和软件中断init_IRQ分别用和 初始化softirq_init第 9 章介绍了中断 。在本章中,我们将了解设备驱动程序如何使用 IRQ 注册处理程序以及 IRQ 处理程序如何在内存中组织。计时器也会在启动过程的早期初始化,以便以后的任务可以使用它们。

Hardware and software interrupts are initialized with init_IRQ and softirq_init, respectively. Interrupts are covered in Chapter 9. In this chapter, we will see just how device drivers register a handler with an IRQ and how IRQ handlers are organized in memory. Timers are also initialized early in the boot process so that later tasks can use them.

初始化例程
Initialization routines

内核子系统和内置设备驱动程序由do_initcalls. free_init_mem释放一段保存不需要的代码的内存。由于智能例程标记,这种优化是可能的。有关更多详细信息,请参阅第 7 章

Kernel subsystems and built-in device drivers are initialized by do_initcalls. free_init_mem frees a piece of memory that holds unneeded code. This optimization is possible thanks to smart routine tagging. See Chapter 7 for more details.

run_init_process确定系统上运行的第一个进程,即所有其他进程的父进程;它的 PID 为 1,并且在系统完成之前不会停止。通常运行的程序是init,它是 SysVinit 包的一部分。然而,管理员可以通过init=启动时间选项指定不同的程序。 如果没有提供这样的选项,内核会尝试从一组众所周知的位置执行init命令,如果找不到任何位置,则会出现恐慌。用户还可以提供将传递给init 的引导时选项(请参阅“引导时内核选项”部分)。

run_init_process determines the first process run on the system, the parent of all other processes; it has a PID of 1 and never halts until the system is done. Normally the program run is init, part of the SysVinit package. However, the administrator can specify a different program through the init= boot time option. When no such option is provided, the kernel tries to execute the init command from a set of well-known locations, and panics if it cannot find any. The user can also provide boot-time options that will be passed to init (see the section "Boot-Time Kernel Options").

设备注册和初始化

Device Registration and Initialization

为了使网络设备可用,它必须被内核识别并与正确的驱动程序关联。驱动程序在私有数据结构中存储驱动设备以及与需要该设备的其他内核组件交互所需的所有信息。注册和初始化任务部分由核心内核负责,部分由设备驱动程序负责。让我们回顾一下初始化阶段:

For a network device to be usable, it must be recognized by the kernel and associated with the correct driver. The driver stores, in private data structures, all the information needed to drive the device and interact with other kernel components that require the device. The registration and initialization tasks are taken care of partially by the core kernel and partially by the device driver. Let's go over the initialization phases:

硬件初始化
Hardware initialization

这是由设备驱动程序与通用总线层(例如 PCI 或 USB)配合完成的。驱动程序有时单独,有时在用户提供的参数的帮助下,配置每个设备的 IRQ 和 I/O 地址等功能,以便它们可以与内核交互。由于此活动更接近设备驱动程序而不是更高层协议和功能,因此我们不会在其上花费太多时间。我们将看到 PCI 层的一个示例。

This is done by the device driver in cooperation with the generic bus layer (e.g., PCI or USB). The driver, sometimes alone and sometimes with the help of user-supplied parameters, configures such features of each device as the IRQ and I/O address so that they can interact with the kernel. Because this activity is closer to the device drivers than to the higher-layer protocols and features, we will not spend much time on it. We will see one example for the PCI layer.

软件初始化
Software initialization

在使用设备之前,根据启用和配置的网络协议,用户可能需要提供一些其他配置参数,例如 IP 地址。此任务将在其他章节中讨论。

Before the device can be used, depending on what network protocols are enabled and configured, the user may need to provide some other configuration parameters, such as IP addresses. This task is addressed in other chapters.

特征初始化
Feature initialization

Linux 内核附带了许多网络选项。由于其中一些需要针对每个设备进行配置,因此设备初始化引导顺序必须处理它们。一个例子是流量控制,它是一个实现服务质量 (QoS) 的子系统,因此决定数据包如何在设备出口队列中排队和出队(并且有一些限制,还可以在入口队列中排队和出队) 。

The Linux kernel comes with lots of networking options. Because some of them need per-device configuration, the device initialization boot sequence must take care of them. One example is Traffic Control, the subsystem that implements Quality of Service (QoS) and that decides, therefore, how packets are queued on and dequeued from the device egress's queue (and with some limitations, also queued on and dequeued from the ingress's queue).

我们在第 2 章中已经看到, net_device数据结构包括一组函数指针,内核使用它们与设备驱动程序和特殊的内核功能进行交互。的初始化 这些功能部分取决于设备的类型(例如以太网),部分取决于设备的品牌和型号。鉴于以太网的普及,本章重点讨论以太网设备的初始化(但其他设备的处理方式非常类似)。

We already saw in Chapter 2 that the net_device data structure includes a set of function pointers that the kernel uses to interact with the device driver and special kernel features. The initialization of these functions depends in part on the type of device (e.g., Ethernet) and in part on the device's make and model. Given the popularity of Ethernet, this chapter focuses on the initialization of Ethernet devices (but other devices are handled very similarly).

第 8 章更详细地介绍了设备驱动程序如何使用网络代码注册其设备。

Chapter 8 goes into more detail on how device drivers register their devices with the networking code.

NIC 初始化的基本目标

Basic Goals of NIC Initialization

每个网络设备在 Linux 内核中都由数据结构的一个实例表示net_device。在第 8 章中,您将看到net_device数据结构是如何分配的以及它们的字段是如何初始化的,部分是由设备驱动程序,部分是由核心内核例程。在本章中,我们重点讨论设备驱动程序如何分配建立设备/内核通信所需的资源,例如:

Each network device is represented in the Linux kernel by an instance of the net_device data structure. In Chapter 8, you will see how net_device data structures are allocated and how their fields are initialized, partly by the device driver and partly by core kernel routines. In this chapter, we focus on how device drivers allocate the resources needed to establish device/kernel communication, such as:

中断请求线
IRQ line

正如您将在“设备和内核之间的交互”部分中看到的那样,需要为 NIC 分配一个 IRQ,并在需要时使用它来引起内核的注意。然而,虚拟设备不需要分配 IRQ:环回设备就是一个例子,因为它的活动完全是内部的(请参阅后面的“虚拟设备”部分)。

用于请求和释放 IRQ 线的两个函数将在后面的“硬件中断”部分中介绍。正如您将在后面的“通过 /proc 文件系统进行调整”部分中看到的, /proc/interrupts文件可用于查看当前分配的状态。

As you will see in the section "Interaction Between Devices and Kernel," NICs need to be assigned an IRQ and to use it to call for the kernel's attention when needed. Virtual devices, however, do not need to be assigned an IRQ: the loopback device is an example because its activity is totally internal (see the later section "Virtual Devices").

The two functions used to request and release IRQ lines are introduced in the later section "Hardware Interrupts." As you will see in the later section "Tuning via /proc Filesystem," the /proc/interrupts file can be used to view the status of the current assignments.

I/O端口和内存注册
I/O ports and memory registration

驱动程序通常将其设备内存区域(例如其配置寄存器)映射到系统内存中,以便驱动程序直接在系统内存地址上进行读/写操作;这可以简化代码。request_regionI/O 端口和内存分别用和来注册和释放release_region

It is common for a driver to map an area of its device's memory (its configuration registers, for example) into the system memory so that read/write operations by the driver will be made on system memory addresses directly; this can simplify the code. I/O ports and memory are registered and released with request_region and release_region, respectively.

设备与内核的交互

Interaction Between Devices and Kernel

几乎所有设备(包括 NIC)都通过以下两种方式之一与内核交互:

Nearly all devices (including NICs) interact with the kernel in one of two ways:

轮询
Polling

在内核端驱动。内核定期检查设备状态,看看它是否有什么要说的。

Driven on the kernel side. The kernel checks the device status at regular intervals to see if it has anything to say.

打断
Interrupt

在设备端驱动。当设备需要内核关注时,它会向内核发送硬件信号(通过生成中断)。

Driven on the device side. The device sends a hardware signal (by generating an interrupt) to the kernel when it needs the kernel's attention.

第 9 章中,您可以找到有关 NIC 驱动程序设计替代方案以及软件中断的详细讨论。您还将了解 Linux 如何结合使用轮询和中断来提高性能。在本章中,我们将只关注基于中断的情况。

In Chapter 9, you can find a detailed discussion of NIC driver design alternatives as well as software interrupts. You will also see how Linux can use a combination of polling and interrupts to increase performance. In this chapter, we will look only at the interrupt-based case.

我不会详细介绍硬件如何报告中断、硬件异常和设备中断之间的区别、驱动程序和总线内核基础设施是如何设计的等等;您可以参考Linux 设备驱动程序了解 Linux 内核来了解这些主题。但我将简要概述中断,以帮助您了解设备驱动程序如何初始化和注册它们负责的设备,特别关注网络方面。

I won't go into detail on how interrupts are reported by the hardware, the difference between hardware exceptions and device interrupts, how the driver and bus kernel infrastructures are designed, etc.; you can refer to Linux Device Drivers and Understanding the Linux Kernel for those topics. But I'll give a brief overview on interrupts to help you understand how device drivers initialize and register the devices they are responsible for, with special attention to the networking aspect.

硬件中断

Hardware Interrupts

您无需了解有关硬件中断方式的底层背景 被处理。但是,有一些细节值得一提,因为它们可以让您更轻松地理解 NIC 设备驱动程序的编写方式,以及它们如何与上层网络层交互。

You do not need to know the low-level background about how hardware interrupts are handled. However, there are details worth mentioning because they can make it easier to understand how NIC device drivers are written, and therefore how they interact with the upper networking layers.

每个中断都运行一个称为中断处理程序的函数,该函数必须针对设备进行定制,因此由设备驱动程序安装。通常,当设备驱动程序注册 NIC 时,它会请求并分配 IRQ。然后,它使用以下两个与体系结构相关的函数来注册和(如果驱动程序已卸载)取消注册给定 IRQ 的处理程序。它们在kernel/irq/manage.c中定义,并被arch/ /kernel/irq.c中特定于体系结构的函数覆盖,其中 是特定于体系结构的目录:XXX XXX

Every interrupt runs a function called an interrupt handler, which must be tailored to the device and therefore is installed by the device driver. Typically, when a device driver registers an NIC, it requests and assigns an IRQ. It then registers and (if the driver is unloaded) unregisters a handler for a given IRQ with the following two architecture-dependent functions. They are defined in kernel/irq/manage.c and are overridden by architecture-specific functions in arch/ XXX /kernel/irq.c, where XXX is the architecture-specific directory:

int request_irq(unsigned int irq, void (*handler)(int, void*, struct pt_regs*), unsigned long irqflags, const char * devname, void *dev_id)
int request_irq(unsigned int irq, void (*handler)(int, void*, struct pt_regs*), unsigned long irqflags, const char * devname, void *dev_id)

该函数注册一个处理程序,首先确保请求的中断是有效的,并且它尚未分配给另一个设备,除非两个设备都理解共享 IRQ(请参阅后面的“中断共享”部分

This function registers a handler, first making sure that the requested interrupt is a valid one, and that it is not already allocated to another device unless both devices understand shared IRQs (see the later section "Interrupt sharing").

void free_irq(unsigned_int irq, void *dev_id)
void free_irq(unsigned_int irq, void *dev_id)

给定由 标识的设备dev_id,如果没有更多设备为该 IRQ 注册,则此函数将删除处理程序并禁用 IRQ 线。请注意,为了识别处理程序,内核需要 IRQ 号和设备标识符。这对于共享 IRQ 尤为重要,如后面的“中断共享”部分所述。

Given the device identified by dev_id, this function removes the handler and disables the IRQ line if no more devices are registered for that IRQ. Note that to identify the handler, the kernel needs both the IRQ number and the device identifier. This is especially important with shared IRQs, as explained in the later section "Interrupt sharing."

当内核收到中断通知时,它使用IRQ号找到驱动程序的处理程序,然后执行该处理程序。为了查找处理程序,内核将 IRQ 号和函数处理程序之间的关联存储在全局表中。这种关联可以是一对一,也可以是一对多,因为Linux内核允许多个设备使用相同的IRQ,这一特性将在后面的“中断共享”一节中介绍

When the kernel receives an interrupt notification, it uses the IRQ number to find out the driver's handler and then executes this handler. To find handlers, the kernel stores the associations between IRQ numbers and function handlers in a global table. The association can be either one-to-one or one-to-many, because the Linux kernel allows multiple devices to use the same IRQ, a feature described in the later section "Interrupt sharing."

在以下部分中,您将看到设备和驱动程序之间通过中断方式交换信息的常见示例,以及在某些条件下多个设备如何共享 IRQ。

In the following sections, you will see common examples of the information exchanged between devices and drivers by means of interrupts, and how an IRQ can be shared by multiple devices under some conditions.

中断类型

Interrupt types

通过中断,NIC 可以告诉其驱动程序一些不同的事情。其中包括:

With an interrupt, an NIC can tell its driver several different things. Among them are:

接收帧
Reception of a frame

这是最常见、最标准的情况。

This is the most common and standard situation.

传输故障
Transmission failure

仅当称为指数二进制退避的功能失败后,才会在以太网设备上生成此类通知(此功能由 NIC 在硬件级别实现)。请注意,驱动程序不会将此通知转发到更高的网络层;他们将通过其他方式了解失败(计时器超时、否定 ACK 等)。

This kind of notification is generated on Ethernet devices only after a feature called exponential binary backoff has failed (this feature is implemented at the hardware level by the NIC). Note that the driver will not relay this notification to higher network layers; they will come to know about the failure by other means (timer timeouts, negative ACKs, etc.).

DMA 传输已成功完成
DMA transfer has completed successfully

给定要发送的帧,一旦帧上传到 NIC 内存以便在介质上传输,驱动程序就会释放保存该帧的缓冲区。通过同步传输(无 DMA),驱动程序可以立即知道帧何时上传到 NIC。但对于使用异步传输的 DMA,设备驱动程序需要等待来自 NIC 的显式中断。您可以在驱动程序代码drivers/net/3c59x.c (DMA) 和drivers/net/3c509.c (非 DMA)中调用dev_kfree_skb [ * ]的位置找到每种情况的示例 。

Given a frame to send, the buffer that holds it is released by the driver once the frame has been uploaded into the NIC's memory for transmission on the medium. With synchronous transmissions (no DMA), the driver knows right away when the frame has been uploaded on the NIC. But with DMA, which uses asynchronous transmissions, the device driver needs to wait for an explicit interrupt from the NIC. You can find an example of each case at points where dev_kfree_skb [*] is called within the driver code drivers/net/3c59x.c (DMA) and drivers/net/3c509.c (non-DMA).

设备有足够的内存来处理新的传输
Device has enough memory to handle a new transmission

当出口队列没有足够的可用空间来容纳最大大小的帧(例如,以太网 NIC 为 1,536 字节)时,NIC 设备驱动程序通常会通过停止出口队列来禁用传输。当内存可用时,队列将重新启用。本节的其余部分将更详细地讨论此案例。

It is common for an NIC device driver to disable transmissions by stopping the egress queue when that queue does not have sufficient free space to hold a frame of maximum size (e.g., 1,536 bytes for an Ethernet NIC). The queue is then re-enabled when memory becomes available. The rest of this section goes into this case in more detail.

前面列表中的最后一个案例涵盖了一种复杂的节流传输方式,如果操作得当,可以提高效率。在此系统中,设备驱动程序会因缺乏排队空间而禁用传输,当可用内存大于给定量(通常是设备的最大传输单元或 MTU)时,要求 NIC 发出中断,然后重新启用传输当中断到来时。

The final case in the previous list covers a sophisticated way of throttling transmissions in a manner that can improve efficiency if done properly. In this system, a device driver disables transmissions for lack of queuing space, asks the NIC to issue an interrupt when the available memory is bigger than a given amount (typically the device's Maximum Transmission Unit, or MTU), and then re-enables transmissions when the interrupt comes.

设备驱动程序还可以在传输之前禁用出口队列(以防止内核在设备上生成另一个传输请求),并仅在 NIC 上有足够的可用内存时重新启用它;如果没有,设备会请求中断,以便稍后恢复传输。以下是此逻辑的示例,取自例程 el3_start_xmitdrivers/net/3c509.c驱动程序将其安装为其结构中的hard_start_xmit [ ]函数net_device

A device driver can also disable the egress queue before a transmission (to prevent the kernel from generating another transmission request on the device), and re-enable it only if there is enough free memory on the NIC; if not, the device asks for an interrupt that allows it to resume transmission at a later time. Here is an example of this logic, taken from the el3_start_xmit routine, which the drivers/net/3c509.c driver installs as its hard_start_xmit [] function in its net_device structure:

静态整型
el3_start_xmit(结构 sk_buff *skb, 结构 net_device *dev)
{
    …………
    netif_stop_queue(开发);
    …………
    if (inw(ioaddr + TX_FREE) > 1536)
        netif_start_queue(dev);
    别的
        outw(SetTxThreshold + 1536, ioaddr + EL3_CMD);
    …………
}
static int
el3_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
    ... ... ...
    netif_stop_queue (dev);
    ... ... ...
    if (inw(ioaddr + TX_FREE) > 1536)
        netif_start_queue(dev);
    else
        outw(SetTxThreshold + 1536, ioaddr + EL3_CMD);
    ... ... ...
}

驱动程序使用 停止设备队列netif_stop_queue,从而阻止内核提交进一步的传输请求。然后,驱动程序检查设备内存是否有足够的可用空间来容纳 1,536 字节的数据包。如果是,驱动程序启动队列以允许内核再次提交传输请求;否则,它会指示设备(通过调用写入配置寄存器outw)在满足该条件时生成中断。然后,中断处理程序将重新启用设备队列,netif_start_queue以便内核可以重新启动传输。

The driver stops the device queue with netif_stop_queue, thus inhibiting the kernel from submitting further transmission requests. The driver then checks whether the device's memory has enough free space for a packet of 1,536 bytes. If so, the driver starts the queue to allow the kernel once again to submit transmission requests; otherwise, it instructs the device (by writing to a configuration register with an outw call) to generate an interrupt when that condition will be met. An interrupt handler will then re-enable the device queue with netif_start_queue so that the kernel can restart transmissions.

第 11 章“启用和禁用传输”部分描述了这些例程。netif_ xxx _queue

The netif_ xxx _queue routines are described in the section "Enabling and Disabling Transmissions" in Chapter 11.

中断共享

Interrupt sharing

IRQ 线是有限的资源。增加系统可承载的设备数量的一个简单方法是允许多个设备共享公共 IRQ。通常,每个驱动程序都会向内核注册该 IRQ 的自己的处理程序。内核不是让内核接收中断通知、找到正确的设备并调用其处理程序,而是简单地调用注册相同共享 IRQ 的那些设备的所有处理程序。由处理程序来过滤虚假调用,例如通过读取其设备上的注册表。

IRQ lines are a limited resource. A simple way to increase the number of devices a system can host is to allow multiple devices to share a common IRQ. Normally, each driver registers its own handler to the kernel for that IRQ. Instead of having the kernel receive the interrupt notification, find the right device, and invoke its handler, the kernel simply invokes all the handlers of those devices that registered for the same shared IRQ. It is up to the handlers to filter spurious invocations, such as by reading a registry on their devices.

对于共享 IRQ 线的一组设备,所有这些设备都必须具有能够处理共享 IRQ 的设备驱动程序。换句话说,每次设备注册IRQ线时,都需要明确说明它是否支持中断共享。例如,注册一个 IRQ 的第一个设备,如“分配给我 IRQ n并使用此例程 fn作为处理程序”之类的内容,还必须指定它是否愿意与其他设备共享 IRQ。当另一个设备驱动程序尝试注册相同的 IRQ 号时,如果它或当前分配 IRQ 的驱动程序无法共享 IRQ,则会被拒绝。

For a group of devices to share an IRQ line, all of them must have device drivers capable of handling shared IRQs. In other words, each time a device registers for an IRQ line, it needs to explicitly say whether it supports interrupt sharing. For example, the first device that registers for one IRQ, saying something like "assign me IRQ n and use this routine fn as the handler," must also specify whether it is willing to share the IRQ with other devices. When another device driver tries to register the same IRQ number, it is refused if either it, or the driver to which the IRQ is currently assigned, is incapable of sharing IRQs.

IRQ 到处理程序映射的组织

Organization of IRQs to handler mappings

IRQ 到处理程序的映射存储在列表向量中,每个 IRQ 都有一个处理程序列表(见图5-2)。仅当多个设备共享相同的 IRQ 时,列表才包含多个元素。向量的大小(即可能的 IRQ 编号的数量)取决于体系结构,可以从 15(在 x86 上)到超过 200 不等。随着中断共享的引入,系统上可以支持更多设备一次。

The mapping of IRQs to handlers is stored in a vector of lists, one list of handlers for each IRQ (see Figure 5-2). A list includes more than one element only when multiple devices share the same IRQ. The size of the vector (i.e., the number of possible IRQ numbers) is architecture dependent and can vary from 15 (on an x86) to more than 200. With the introduction of interrupt sharing, even more devices can be supported on a system at once.

“硬件中断”一节已经介绍了内核提供的两个函数,分别用于注册和取消注册处理程序。现在让我们看看用于存储映射的数据结构。

The section "Hardware Interrupts" already introduced the two functions provided by the kernel to register and unregister a handler, respectively. Let's now see the data structure used to store the mappings.

映射是用irqaction数据结构定义的。前面“硬件中断request_irq”部分中介绍的函数是的包装器,它将结构体作为输入并将其插入到全局向量中。在kernel/irq/handler.c中定义,并且可以在每个体系结构文件 arch/ /kernel/irq.c中覆盖。在kernel/irq/manage.c中定义,并且可以在每个体系结构文件 arch/ /kernel/irq.c中覆盖。setup_irqirqactionirq_descirq_desc XXX setup_irq XXX

Mappings are defined with irqaction data structures. The request_irq function introduced in the earlier section "Hardware Interrupts" is a wrapper around setup_irq, which takes an irqaction structure as input and inserts it into the global irq_desc vector. irq_desc is defined in kernel/irq/handler.c and can be overridden in the per-architecture files arch/ XXX /kernel/irq.c. setup_irq is defined in kernel/irq/manage.c and can be overridden in the per-architecture files arch/ XXX /kernel/irq.c.

处理中断并将其传递给驱动程序的内核函数取决于体系结构。handle_IRQ_event大多数架构都会调用它。

The kernel function that handles interrupts and passes them to drivers is architecture dependent. It is called handle_IRQ_event on most architectures.

图 5-2显示了实例的存储方式:每个可能的 IRQirqaction都有一个 的实例,每个成功注册的 IRQ 处理程序都有一个 的实例。实例向量也被调用,其大小由依赖于体系结构的符号 给出。irq_descirqactionirq_descirq_descNR_IRQS

Figure 5-2 shows how irqaction instances are stored: there is an instance of irq_desc for each possible IRQ and an instance of irqaction for each successfully registered IRQ handler. The vector of irq_desc instances is called irq_desc as well, and its size is given by the architecture-dependent symbol NR_IRQS.

irqaction请注意,当给定 IRQ 编号(即向量的给定元素)有多个 实例时irq_desc,需要中断共享(每个结构都必须SA_SHIRQ设置标志)。

Note that when you have more than one irqaction instance for a given IRQ number (that is, for a given element of the irq_desc vector), interrupt sharing is required (each structure must have the SA_SHIRQ flag set).

IRQ 处理程序的组织

图 5-2。IRQ 处理程序的组织

Figure 5-2. Organization of IRQ handlers

现在让我们看看数据结构的字段中存储了有关 IRQ 处理程序的哪些信息irqaction

Let's see now what information is stored about IRQ handlers in the fields of an irqaction data structure:

void (*handler)(int irq, void *dev_id, struct pt_regs *regs)
void (*handler)(int irq, void *dev_id, struct pt_regs *regs)

设备驱动程序提供的处理中断通知的函数:每当内核在线接收到中断时irq,它就会调用handler. 以下是该函数的输入参数:

int irq

生成通知的 IRQ 号。大多数时候,NIC 的设备驱动程序不使用它来完成其工作;设备 ID 就足够了。

void *dev_id

设备标识符。同一驱动程序可以同时负责不同的设备,因此需要设备ID才能正确处理通知。

struct pt_regs *regs

用于保存中断中断当前进程时处理器寄存器内容的结构。中断处理程序通常不使用它。

Function provided by the device driver to handle notifications of interrupts: whenever the kernel receives an interrupt on line irq, it invokes handler. Here are the function's input parameters:

int irq

IRQ number that generated the notification. Most of the time it is not used by the NICs' device drivers to accomplish their job; the device ID is sufficient.

void *dev_id

Device identifier. The same driver can be responsible for different devices at the same time, so it needs the device ID to process the notification correctly.

struct pt_regs *regs

Structure used to save the content of the processor's registers at the moment the interrupt interrupted the current process. It is normally not used by the interrupt handler.

unsigned long flags
unsigned long flags

一组标志。可能的值在include/asm-XXX/signal.hSA_ XXX中定义。以下是 x86 架构文件中的主要内容:

SA_SHIRQ

设置后,设备驱动程序可以处理共享 IRQ。

SA_SAMPLE_RANDOM

设置后,设备将自身作为随机事件源。这对于帮助内核生成供内部使用的随机数很有用,称为对系统熵的贡献。这将在后面的“初始化设备处理层:net_dev_init ”部分中进一步描述。

SA_INTERRUPT

设置后,处理程序将在本地处理器上禁用中断的情况下运行。这只应该为可以很快完成的处理程序指定。请参阅实例之一handle_IRQ_event以获取示例(例如, /kernel/irq/handle.c)。

还有其他值,但它们要么已过时,要么仅由特定架构使用。

Set of flags. The possible values SA_ XXX are defined in include/asm-XXX/signal.h. Here are the main ones from the x86 architecture file:

SA_SHIRQ

When set, the device driver can handle shared IRQs.

SA_SAMPLE_RANDOM

When set, the device is making itself available as a source of random events. This can be useful to help the kernel generate random numbers for internal use, and is called contributing to system entropy. This is further described in the later section "Initializing the Device Handling Layer: net_dev_init."

SA_INTERRUPT

When set, the handler runs with interrupts disabled on the local processor. This should be specified only for handlers that can get done very quickly. See one of the handle_IRQ_event instances for an example (for instance, /kernel/irq/handle.c).

There are other values, but they are either obsolete or used only by particular architectures.

void *dev_id
void *dev_id

指向net_device与设备关联的数据结构的指针。声明它的原因void *是 NIC 不是唯一使用 IRQ 的设备。由于各种设备类型使用不同的数据结构来标识和表示设备实例,因此使用通用类型声明。

Pointer to the net_device data structure associated with the device. The reason it is declared void * is that NICs are not the only devices to use IRQs. Because various device types use different data structures to identify and represent device instances, a generic type declaration is used.

struct irqaction *next
struct irqaction *next

共享相同 IRQ 编号的所有设备都通过该指针链接在一个列表中。

All the devices sharing the same IRQ number are linked together in a list with this pointer.

const char *name
const char *name

设备名称。您可以通过转储/proc/interrupts的内容来读取它。

Device name. You can read it by dumping the contents of /proc/interrupts.

初始化选项

Initialization Options

内置于内核中的组件和作为模块加载的组件都可以传递输入参数,以便用户可以微调组件实现的功能,覆盖编译到其中的默认值,或者将它们从一个系统启动更改为下一个系统启动。内核提供了两种宏来定义选项:

Both components built into the kernel and components loaded as modules can be passed input parameters so that users can fine-tune the functionality implemented by the components, override defaults compiled into them, or change them from one system boot to the next. The kernel provides two kinds of macros to define options :

模块选项(系列宏 module_param
Module options (macros of the module_param family)

这些定义了您在加载模块时可以提供的选项。当组件内置到内核中时,您无法在内核启动时为这些选项提供值。但是,随着/sys文件系统的引入,您可以在运行时通过这些文件配置选项。与/proc接口相比, /sys接口相对较新。后面的“模块选项”部分将更详细地介绍这些选项。

These define options you can provide when you load a module. When a component is built into the kernel, you cannot provide values for these options at kernel boot time. However, with the introduction of the /sys filesystem, you can configure the options via those files at runtime. The /sys interface is relatively new, compared to the /proc interface. The later section "Module Options" goes into a little more detail on these options.

启动时内核选项(系列宏 _ _setup
Boot-time kernel options (macros of the _ _setup family)

这些定义了您可以在引导时通过引导加载程序提供的选项。它们主要由用户可以构建到内核中的模块以及不能编译为模块的内核组件使用。您将在第 7 章的“引导时内核选项”部分中看到这些宏。

These define options you can provide at boot time with a boot loader. They are used mainly by modules that the user can build into the kernel, and kernel components that cannot be compiled as modules. You will see those macros in the section "Boot-Time Kernel Options" in Chapter 7.

有趣的是,模块可以通过两种方式定义初始化选项:一种在模块内置时有效,另一种在模块单独加载时有效。这可能有点令人困惑,特别是因为不同的模块可以在模块加载时定义传递相同名称的参数,而不会出现任何名称冲突的风险(即,参数仅传递给正在加载的模块),但如果传递这些参数在内核启动时,您必须确保各个模块的选项之间没有名称冲突。

It is interesting to note that a module can define an initialization option in both ways: one is effective when the module is built-in and the other is effective when the module is loaded separately. This can be a little confusing, especially because different modules can define passing parameters of the same name at module load time without any risk of name collision (i.e., the parameters are passed just to the module being loaded), but if you pass those parameters at kernel boot time, you must make sure there is no name collision between the various modules' options.

我们不会详细讨论这两种方法的优缺点。您可以查看drivers/block/loop.cmodule_param驱动程序,以获取使用和的清晰示例_ _setup

We will not go into detail on the pros and cons of the two approaches. You can look at the drivers/block/loop.c driver for a clear example using both module_param and _ _setup.

模块选项

Module Options

内核模块通过宏定义其参数,例如module_param;有关列表,请参阅include/linux/moduleparam.h 。需要三个输入参数,如drivers/net/sis900.cmodule_param中的以下示例所示:

Kernel modules define their parameters by means of macros such as module_param; see include/linux/moduleparam.h for a list. module_param requires three input parameters, as shown in the following example from drivers/net/sis900.c:

...
module_param(multicast_filter_limit, int 0444);
module_param(max_interrupt_work, int, 0444);
module_param(调试,int,0444);
...
...
module_param(multicast_filter_limit, int 0444);
module_param(max_interrupt_work, int, 0444);
module_param(debug, int, 0444);
...

第一个输入参数是要提供给用户的参数的名称。第二个是参数的类型(例如整数),第三个表示分配给/sys中参数将导出到的文件的权限。

The first input parameter is the name of the parameter to be offered to the user. The second is the type of the parameter (e.g., integer), and the third represents the permissions assigned to the file in /sys to which the parameter will be exported.

这是在/sys中列出模块目录时您会得到的结果:

This is what you would get when listing the module's directory in /sys:

[root@localhost src]#ls -la /sys/module/sis900/parameters/
总计 0
drwxr-xr-x 2 根根 0 4 月 9 日 18:31 。
drwxr-xr-x 4 root root 0 四月 9 18:31 ..
-r--r--r-- 1 root root 0 Apr 9 18:31 调试
-r--r--r-- 1 root root 4096 4 月 9 日 18:31 max_interrupt_work
-r--r--r-- 1 根 4096 4 月 9 日 18:31 多播过滤器限制
[root@localhost src]#
[root@localhost src]# ls -la /sys/module/sis900/parameters/
total 0
drwxr-xr-x  2 root root    0 Apr  9 18:31 .
drwxr-xr-x  4 root root    0 Apr  9 18:31 ..
-r--r--r--  1 root root    0 Apr  9 18:31 debug
-r--r--r--  1 root root 4096 Apr  9 18:31 max_interrupt_work
-r--r--r--  1 root root 4096 Apr  9 18:31 multicast_filter_limit
[root@localhost src]#

每个模块都在/sys/modules中分配一个目录。子目录/sys/modules/ /保存由 导出的每个参数的文件 。drivers/net/sis900.c中的上一个快照显示了任何人都可读但不可写(无法更改)的三个选项。module parametersmodule

Each module is assigned a directory in /sys/modules. The subdirectory /sys/modules/ module / parameters holds a file for each parameter exported by module. The previous snapshot from drivers/net/sis900.c shows three options that are readable by anyone, but not writable (they cannot be changed).

/sys文件(顺便说一句,还有/proc文件)的权限是使用与常见文件相同的语法定义的,因此您可以为所有者、组和其他人指定读取、写入和执行权限。例如,值 400 表示所有者(root 用户)具有读取访问权限,而任何人都没有其他访问权限。当分配值 0 时,没有人具有任何权限,您甚至不会在/sys中看到该文件。

Permissions on /sys files (and on /proc files, incidentally) are defined using the same syntax as common files, so you can specify read, write, and execute permissions for the owner, the group, and everybody else. A value of 400 means, for example, read access for the owner (who is the root user) and no other access for anyone. When a value of 0 is assigned, no one has any permissions and you would not even see the file in /sys.

如果组件程序员希望用户能够读取参数的值,她必须至少授予读取权限。她还可以提供写权限以允许用户修改值。但是,请考虑到导出参数的模块不会收到有关文件发生任何更改的通知,因此该模块必须具有检测更改或能够应对更改的机制。

If the component programmer wants the user to be able to read the values of parameters, she must give at least read permission. She can also provide write permission to allow users to modify values. However, take into account that the module that exports the parameter is not notified about any change to the file, so the module must have a mechanism to detect the change or be able to cope with changes.

有关/sys接口的详细说明,请参阅Linux 设备驱动程序

For a detailed description of the /sys interface, refer to Linux Device Drivers.

初始化设备处理层:net_dev_init

Initializing the Device Handling Layer: net_dev_init

网络代码初始化的一个重要部分,包括流量控制和每 CPU 入口队列,是在启动时由 net/core/dev.cnet_dev_init中定义的执行的:

An important part of initialization for the networking code, including Traffic Control and per-CPU ingress queues, is performed at boot time by net_dev_init, defined in net/core/dev.c:

静态 int _ _init net_dev_init(void)
{
    ...
}
subsys_initcall(net_dev_init);
static int _ _init net_dev_init(void)
{
    ...
}
subsys_initcall(net_dev_init);

请参阅第 7 章,了解宏如何subsys_initcall确保net_dev_init在任何 NIC 设备驱动程序注册之前运行,以及为什么这很重要。您还将看到为什么net_dev_init_ _init宏标记。

See Chapter 7 for how the subsys_initcall macros ensure that net_dev_init runs before any NIC device drivers register themselves, and why this is important. You also will see why net_dev_init is tagged with the _ _init macro.

让我们来看看主要部分net_dev_init

Let's walk through the main parts of net_dev_init:

  • 两个网络软件中断(软中断)使用的每 CPU 数据结构均已初始化。在第 9 章中,我们将了解什么是软中断,并详细介绍网络代码如何使用软中断。

  • The per-CPU data structures used by the two networking software interrupts (softirqs) are initialized. In Chapter 9, we will see what a softirq is and go into detail on how the networking code uses softirqs.

  • 当内核编译为支持/proc文件系统(这是默认配置)时,一些文件将被添加到/proc中,并带有dev_proc_initdev_mcast_init有关更多详细信息,请参阅后面的“通过 /proc 文件系统调整”部分。

  • When the kernel is compiled with support for the /proc filesystem (which is the default configuration), a few files are added to /proc with dev_proc_init and dev_mcast_init. See the later section "Tuning via /proc Filesystem" for more details.

  • netdev_sysfs_initnetsysfs注册该类。这将创建目录/sys/class/net,在该目录下您将找到每个已注册网络设备的子目录。这些目录包含大量文件,其中一些文件曾经位于/proc中。

  • netdev_sysfs_init registers the net class with sysfs. This creates the directory /sys/class/net, under which you will find a subdirectory for each registered network device. These directories include lots of files, some of which used to be in /proc.

  • net_random_init初始化每个 CPU 的种子向量,该向量将在使用例程生成随机数时使用net_randomnet_random用于不同的上下文,本节稍后将对此进行描述。

  • net_random_init initializes a per-CPU vector of seeds that will be used when generating random numbers with the net_random routine. net_random is used in different contexts, described later in this section.

  • 第 33 章中描述的与协议无关的目标缓存 (DST)是用 初始化的dst_init

  • The protocol-independent destination cache (DST), described in Chapter 33, is initialized with dst_init.

  • ptype_base用于解复用入口流量的协议处理程序向量已初始化。详细信息请参见第 13 章。

  • The protocol handler vector ptype_base, used to demultiplex ingress traffic, is initialized. See Chapter 13 for more details.

  • 定义符号后OFFLINE_SAMPLE,内核会设置一个定期运行的函数来收集有关设备队列长度的统计信息。在这种情况下,net_dev_init 需要创建定期运行该函数的计时器。请参阅第 10 章中的“平均队列长度和拥塞级别计算”部分。

  • When the OFFLINE_SAMPLE symbol is defined, the kernel sets up a function to run at regular intervals to collect statistics about the devices' queue lengths. In this case, net_dev_init needs to create the timer that runs the function regularly. See the section "Average Queue Length and Congestion-Level Computation" in Chapter 10.

  • 回调处理程序注册到通知链中,发出有关 CPU 热插拔事件的通知。使用的回调是dev_cpu_callback. 目前,处理的唯一事件是 CPU 的停止。当收到此通知时,CPU 入口队列中的缓冲区将出队并传递到netif_rx。有关每个 CPU 入口队列的更多详细信息,请参阅第 9 章。

  • A callback handler is registered with the notification chain that issues notifications about CPU hotplug events. The callback used is dev_cpu_callback. Currently, the only event processed is the halting of a CPU. When this notification is received, the buffers in the CPU's ingress queue are dequeued and are passed to netif_rx. See Chapter 9 for more detail on per-CPU ingress queues.

随机数生成是内核执行的一项支持功能,旨在帮助随机化其自身的一些活动。您将在本书中看到许多网络子系统使用随机生成的值。例如,他们经常在定时器的延迟中添加随机成分,从而降低定时器同时运行和后台处理负载的 CPU 的可能性。随机化还可以防御试图猜测某些数据结构的组织的拒绝服务 (DoS) 攻击。

Random number generation is a support function that the kernel performs to help randomize some of its own activity. You will see in this book that many networking subsystems use randomly generated values. For instance, they often add a random component to the delay of timers, making it less likely for timers to run simultaneously and load down the CPU with background processing. Randomization can also defend against a Denial of Service (DoS) attack by someone who tries to guess the organization of certain data structures.

内核数字可以被认为是真正随机的程度称为 系统熵 。它是通过内核组件的贡献来改进的,这些组件的活动具有不确定性,而网络通常属于这一类。目前,只有少数 NIC 设备驱动程序会对系统熵产生影响(请参阅前面关于 的讨论SA_SAMPLE_RANDOM)。内核 2.4 的补丁添加了一个编译时选项,您可以使用该选项启用或禁用 NIC 对系统熵的贡献。使用关键字“SA_SAMPLE_NET_RANDOM”搜索网络,您将找到当前版本。

The degree to which the kernel's numbers can be considered truly random is called system entropy . It is improved through contributions by kernel components whose activity has a nondeterministic aspect, and networking often falls in this category. Currently, only a few NIC device drivers contribute to system entropy (see earlier discussion on SA_SAMPLE_RANDOM). A patch for kernel 2.4 adds a compile time option that you can use to enable or disable the contribution to system entropy by NICs. Search the Web using the keyword "SA_SAMPLE_NET_RANDOM," and you will find the current version.

遗留代码

Legacy Code

我在上一节中提到,subsys_initcall宏确保net_dev_init在任何设备驱动程序有机会注册其设备之前执行。在引入这一机制之前,执行顺序通常以不同的方式强制执行,使用一次性标志的老式机制。

I mentioned in the previous section that the subsys_initcall macros ensure that net_dev_init is executed before any device driver has a chance to register its devices. Before the introduction of this mechanism, the order of execution used to be enforced differently, using the old-fashioned mechanism of a one-time flag.

全局变量dev_boot_phase用作布尔标志来记住是否net_dev_init必须执行。它被初始化为1(即net_dev_init尚未执行)并被清除net_dev_init。每次register_netdevice由设备驱动程序调用时,它都会检查 的值, 如果dev_boot_phase设置net_dev_init了标志,则执行该标志,表明该函数尚未执行。

The global variable dev_boot_phase was used as a Boolean flag to remember whether net_dev_init had to be executed. It was initialized to 1 (i.e., net_dev_init had not been executed yet) and was cleared by net_dev_init. Each time register_netdevice was invoked by a device driver, it checked the value of dev_boot_phase and executed net_dev_init if the flag was set, indicating the function had not yet been executed.

不再需要此机制,因为如果将正确的标记应用于关键设备驱动程序的例程,则register_netdevice之前无法调用该机制,如第 7 章所述。但是,为了检测错误的标记或有错误的代码, 仍然清除 的值,并 使用宏来确保在设置时永远不会调用它。[ * ]net_dev_initnet_dev_initdev_boot_phaseregister_netdeviceBUG_ONdev_boot_phase

This mechanism is not needed anymore, because register_netdevice cannot be called before net_dev_init if the correct tagging is applied to key device drivers' routines, as described in Chapter 7. However, to detect wrong tagging or buggy code, net_dev_init still clears the value of dev_boot_phase, and register_netdevice uses the macro BUG_ON to make sure it is never called when dev_boot_phase is set.[*]

用户空间助手

User-Space Helpers

在某些情况下,内核调用用户空间应用程序来处理事件是有意义的。其中两个助手尤其重要:

There are cases where it makes sense for the kernel to invoke a user-space application to handle events. Two such helpers are particularly important:

/sbin/modprobe
/sbin/modprobe

当内核需要加载模块时调用。该帮助程序是module-init-tools包的一部分 。

Invoked when the kernel needs to load a module. This helper is part of the module-init-tools package.

/sbin/热插拔
/sbin/hotplug

当内核检测到新设备已插入系统或从系统中拔出时调用。它的主要工作是根据设备标识符加载正确的设备驱动程序(模块)。设备由它们所插入的总线(例如,PCI)以及总线规范定义的关联ID 来标识。[ ]该助手是热插拔 包的一部分。

Invoked when the kernel detects that a new device has been plugged or unplugged from the system. Its main job is to load the correct device driver (module) based on the device identifier. Devices are identified by the bus they are plugged into (e.g., PCI) and the associated ID defined by the bus specification.[] This helper is part of the hotplug package.

内核提供了一个名为call_usermodehelper执行此类用户空间帮助程序的函数。该函数允许调用者向应用程序传递可变数量的参数 inarg[]和环境变量 in env[]。例如,第一个参数arg[0]告诉call_usermodehelper要启动哪个用户空间帮助程序,并且arg[1]可以用来告诉帮助程序本身要使用什么配置脚本(通常称为用户空间代理)。我们将在后面的“ /sbin/hotplug ”部分中看到一个示例。

The kernel provides a function named call_usermodehelper to execute such user-space helpers. The function allows the caller to pass the application a variable number of both arguments in arg[] and environment variables in env[]. For example, the first argument arg[0] tells call_usermodehelper what user-space helper to launch, and arg[1] can be used to tell the helper itself what configuration script to use (often called the user-space agent). We will see an example in the later section "/sbin/hotplug."

图 5-3显示了两个内核例程request_modulekobject_hotplug分别如何call_usermodehelper调用/sbin/modprobe/sbin/hotplug。它还显示了如何在两种情况下初始化arg[]和 的示例。envp[]以下小节将更详细地介绍这两个用户空间帮助程序。

Figure 5-3 shows how two kernel routines, request_module and kobject_hotplug, invoke call_usermodehelper to invoke /sbin/modprobe and /sbin/hotplug, respectively. It also shows examples of how arg[] and envp[] are initialized in the two cases. The following subsections go into a little more detail on each of those two user-space helpers.

事件从内核到用户空间的传播

图 5-3。事件从内核到用户空间的传播

Figure 5-3. Event propagation from kernel to user space

克莫德

kmod

kmod是内核模块加载器,允许内核组件请求加载一个模块的。内核提供了不止一个例程,但这里我们只看一下request_module. 该函数arg[1]使用要加载的模块的名称进行初始化。/sbin/modprobe使用配置文件/etc/modprobe.conf执行各种操作,其中之一是查看从内核接收到的模块名称实际上是否是其他名称的别名(见图5-3)。

kmod is the kernel module loader that allows kernel components to request the loading of a module. The kernel provides more than one routine, but here we'll look only at request_module. This function initializes arg[1] with the name of the module to load. /sbin/modprobe uses the configuration file /etc/modprobe.conf to do various things, one of which is to see whether the module name received from the kernel is actually an alias to something else (see Figure 5-3).

以下是两个导致内核要求/sbin/modprobe加载模块的事件示例:

Here are two examples of events that would lead the kernel to ask /sbin/modprobe to load a module:

  • 当管理员使用ifconfig配置一个尚未加载设备驱动程序的网卡时(例如,对于设备 eth0 [ * ] ),内核会向/sbin/modprobe发送请求以加载名称为 string 的模块"eth0"。如果/etc/prorobe.conf包含该条目"alias eth0 3c59x"/sbin/modprobe会尝试加载模块3c59x.ko

  • When the administrator uses ifconfig to configure a network card whose device driver has not been loaded yet—say, for device eth0 [*]—the kernel sends a request to /sbin/modprobe to load the module whose name is the string "eth0". If /etc/prorobe.conf contains the entry "alias eth0 3c59x", /sbin/modprobe tries loading the module 3c59x.ko.

  • 当管理员使用 IPROUTE2 包的tc命令在设备上配置流量控制时,它可能指的是不在内核中的排队规则或分类器。在这种情况下,内核会向 /sbin/modprobe发送加载相关模块的请求。

  • When the administrator configures Traffic Control on a device with the IPROUTE2 package's tc command, it may refer to a queuing discipline or a classifier that is not in the kernel. In this case, the kernel sends /sbin/modprobe a request to load the relevant module.

有关模块和的更多详细信息kmod,请参阅 Linux 设备驱动程序

For more details on modules and kmod, refer to Linux Device Drivers.

热插拔

Hotplug

Hotplug 被引入 Linux 内核以实现流行的消费者功能,即即插即用 (PnP)。该功能允许内核检测热插拔的插入或移除设备并通知用户空间应用程序,为后者提供足够的详细信息,使其能够在需要时加载关联的驱动程序,并应用关联的配置(如果存在)。

Hotplug was introduced into the Linux kernel to implement the popular consumer feature known as Plug and Play (PnP) . This feature allows the kernel to detect the insertion or removal of hot-pluggable devices and to notify a user-space application, giving the latter enough details to make it able to load the associated driver if needed, and to apply the associated configuration if one is present.

实际上,热插拔也可用于在启动时处理不可热插拔的设备。这个想法是,设备是否在正在运行的系统上热插拔或者是否在启动时已插入并不重要;在这两种情况下都会通知用户空间助手。用户空间应用程序决定该事件是否需要其采取任何操作。

Hotplug can actually be used to take care of non-hot-pluggable devices as well, at boot time. The idea is that it does not matter whether a device was hot-plugged on a running system or if it was already plugged in at boot time; the user-space helper is notified in both cases. The user-space application decides whether the event requires any action on its part.

Linux 系统与大多数 Unix 系统一样,在启动时执行一组脚本来初始化外围设备,包括网络设备。这些脚本的语法、名称和位置随不同的 Linux 发行版而变化。(例如,使用 System V init模型的发行版在/etc/rc.d/中为每个运行级别都有一个目录 ,每个目录都有自己的配置文件,指示要启动的内容。其他发行版要么基于 BSD 模型,要么遵循与 System V 兼容的 BSD 模型。)因此,启动时已存在的设备的通知可能会被忽略,因为脚本最终将配置关联的设备。

Linux systems, like most Unix systems, execute a set of scripts at boot time to initialize peripherals, including network devices. The syntax, names, and locations of these scripts change with different Linux distributions. (For example, distributions using the System V init model have a directory per run level in /etc/rc.d/, each one with its own configuration file indicating what to start. Other distributions are either based on the BSD model, or follow the BSD model in compatibility mode with System V.) Therefore, notifications for devices already present at boot time may be ignored because the scripts will eventually configure the associated devices.

当您编译内核模块时,目标文件默认放置在目录/lib/modules/ /中, 例如 2.6.12。在同一目录中,您可以找到两个有趣的文件: modules.pcimapmodules.usbmap。这些文件分别包含内核支持的设备的PCI ID [ * ]和 USB ID。对于每个设备 ID,相同的文件包括对关联内核模块的引用。当用户空间助手收到有关热插拔设备被插入的通知时,它会使用这些文件来查找正确的设备驱动程序。kernel_version kernel_version

When you compile the kernel modules, the object files are placed by default in the directory /lib/modules/ kernel_version /, where kernel_version is, for instance, 2.6.12. In the same directory you can find two interesting files: modules.pcimap and modules.usbmap. These files contain, respectively, the PCI IDs[*] and USB IDs of the devices supported by the kernel. The same files include, for each device ID, a reference to the associated kernel module. When the user-space helper receives a notification about a hot-pluggable device being plugged, it uses these files to find out the correct device driver.

模块 xxxmap文件由设备驱动程序提供的 ID 向量填充。例如,您将在第 6 章的“ PCI NIC 驱动程序注册示例” 部分中看到Vortex 驱动程序如何初始化其. 由于该驱动程序是为 PCI 设备编写的,因此该表的内容将进入module.pcimap文件。pci_device_id

The modules. xxxmap files are populated from ID vectors provided by device drivers. For example, you will see in the section "Example of PCI NIC Driver Registration" in Chapter 6 how the Vortex driver initializes its instance of pci_device_id. Because that driver is written for a PCI device, the contents of that table go into the modules.pcimap file.

如果您对最新代码感兴趣,可以在 http://linux-hotplug.sourceforge.net找到更多信息。

If you are interested in the latest code, you can find more information at http://linux-hotplug.sourceforge.net.

/sbin/热插拔

/sbin/hotplug

Hotplug 的默认用户空间帮助程序是脚本[ ] /sbin/hotplug,它是 Hotplug 包的一部分。可以使用位于默认目录/etc/hotplug//etc/hotplug.d/中的文件来配置此包。

The default user-space helper for Hotplug is the script[]/sbin/hotplug, part of the Hotplug package. This package can be configured with the files located in the default directories /etc/hotplug/ and /etc/hotplug.d/.

kobject_hotplug函数由内核调用以响应设备的插入和移除以及其他事件。 kobject_hotplug初始化arg[0]/sbin/hotplugarg[1]要使用的代理:/sbin/hotplug是一个简单的脚本,它将事件的处理委托给另一个基于arg[1].

The kobject_hotplug function is invoked by the kernel to respond to the insertion and removal of a device, among other events. kobject_hotplug initializes arg[0] to /sbin/hotplug and arg[1] to the agent to be used: /sbin/hotplug is a simple script that delegates the processing of the event to another script (the agent) based on arg[1].

用户空间帮助代理可以或多或少复杂,具体取决于您希望自动配置的程度。Hotplug 软件包提供的脚本尝试识别 Linux 发行版,并使操作适应其配置文件的语法和位置。

The user-space helper agents can be more or less complex based on how fancy you want the auto-configuration to be. The scripts provided with the Hotplug package try to recognize the Linux distribution and adapt the actions to their configuration file's syntax and location.

让我们以本书的主题网络作为热插拔的示例。当向系统添加或删除 NIC 时,kobject_hotplug初始化arg[1]net,导致/sbin/hotplug执行net.agent代理。

Let's take networking, the subject of this book, as an example of hotplugging. When an NIC is added to or removed from the system, kobject_hotplug initializes arg[1] to net, leading /sbin/hotplug to execute the net.agent agent.

与图 5-3中所示的其他代理不同,net.agent不代表介质或总线类型。当网络代理用于配置设备时,其他代理用于根据设备标识符加载正确的模块(设备驱动程序)。

Unlike the other agents shown in Figure 5-3, net.agent does not represent a medium or bus type. While the net agent is used to configure a device, other agents are used to load the correct modules (device drivers) based on the device identifiers.

net.agent应该应用与新设备关联的任何配置,因此它需要内核至少提供设备标识符。在图5-3所示的示例中,设备标识符由内核通过环境变量传递INTERFACE

net.agent is supposed to apply any configuration associated with the new device, so it needs the kernel to provide at least the device identifier. In the example shown in Figure 5-3, the device identifier is passed by the kernel through the INTERFACE environment variable.

为了能够配置设备,必须首先创建它并向内核注册。该任务通常由关联的设备驱动程序驱动,因此必须首先加载该驱动程序。例如,添加 PCMCIA 以太网卡会导致多次调用/sbin/hotplug;他们之中:

To be able to configure a device, it must first be created and registered with the kernel. This task is normally driven by the associated device driver, which must therefore be loaded first. For instance, adding a PCMCIA Ethernet card causes several calls to /sbin/hotplug; among them:

  • 一个导致执行/sbin/modprobe[ * ]它将负责加载正确的模块设备驱动程序。对于 PCMCIA,驱动程序由pci.agent代理加载(使用操作ADD)。

  • One leading to the execution of /sbin/modprobe,[*] which will take care of loading the right module device driver. In the case of PCMCIA, the driver is loaded by the pci.agent agent (using the action ADD).

  • 一是配置新设备。这是由net.agent代理完成的(再次使用操作ADD)。

  • One configuring the new device. This is done by the net.agent agent (again using the action ADD).

虚拟设备

Virtual Devices

虚拟设备是构建在一个或多个真实设备之上的抽象。虚拟设备和真实设备之间的关联可以是多对多的,如图5-4中的三个模型所示。还可以在其他虚拟设备之上构建虚拟设备。然而,并非所有组合都是有意义的或内核支持的。

A virtual device is an abstraction built on top of one or more real devices. The association between virtual devices and real devices can be many-to-many, as shown by the three models in Figure 5-4. It is also possible to build virtual devices on top of other virtual devices. However, not all combinations are meaningful or are supported by the kernel.

虚拟设备和真实设备之间可能的关系

图 5-4。虚拟设备和真实设备之间可能的关系

Figure 5-4. Possible relationship between virtual and real devices

虚拟设备示例

Examples of Virtual Devices

Linux 允许您定义不同类型的虚拟设备。这里有一些例子:

Linux allows you to define different kinds of virtual devices. Here are a few examples:

粘合
Bonding

通过此功能,虚拟设备可以捆绑一组物理设备,并使它们表现得像一个设备。

With this feature, a virtual device bundles a group of physical devices and makes them behave as one.

802.1Q
802.1Q

这是一个 IEEE 标准,使用所谓的 VLAN 标头扩展 802.3/以太网标头,从而允许创建虚拟 LAN。

This is an IEEE standard that extends the 802.3/Ethernet header with the so-called VLAN header, allowing for the creation of Virtual LANs.

桥接
Bridging

一座桥接口是桥的虚拟表示。详情见第四部分

A bridge interface is a virtual representation of a bridge. Details are in Part IV.

接口别名
Aliasing interfaces

最初,此功能的主要目的是允许单个真实以太网接口跨越多个虚拟接口(eth0:0eth0:1等),每个虚拟接口都有自己的 IP 配置。现在,由于网络代码的改进,无需定义新的虚拟接口来在同一网卡上配置多个 IP 地址。然而,在某些情况下(特别是路由),在同一网卡上使用不同的虚拟网卡可能会让生活变得更轻松,也许允许更简单的配置。详细内容参见第 30 章

Originally, the main purpose for this feature was to allow a single real Ethernet interface to span several virtual interfaces (eth0:0, eth0:1, etc.), each with its own IP configuration. Now, thanks to improvements to the networking code, there is no need to define a new virtual interface to configure multiple IP addresses on the same NIC. However, there may be cases (notably routing) where having different virtual NICs on the same NIC would make life easier, perhaps allowing simpler configuration. Details are in Chapter 30.

真正的均衡器 (TEQL)
True equalizer (TEQL)

这是一种可以与流量控制一起使用的排队规则。它的实现需要创建一个特殊的设备。TEQL 背后的想法与 Bonding 有点相似。

This is a queuing discipline that can be used with Traffic Control. Its implementation requires the creation of a special device. The idea behind TEQL is a bit similar to Bonding.

隧道接口
Tunnel interfaces

IP-over-IP 隧道 (IPIP) 和通用路由封装 (GRE) 的实现协议基于虚拟设备的创建。

The implementation of IP-over-IP tunneling (IPIP) and the Generalized Routing Encapsulation (GRE) protocol is based on the creation of a virtual device.

该列表并不完整。此外,考虑到 Linux 内核中包含新功能的速度,您可以期待看到新的虚拟设备 被添加到内核中。

This list is not complete. Also, given the speed with which new features are included into the Linux kernel, you can expect to see new virtual devices being added to the kernel.

绑定、桥接和 802.1Q 设备是图 5-4(c)中模型的示例。别名接口是图 5-4(b)中模型的示例。图5-4(a)中的模型可以看作是其他两个模型的特例。

Bonding, bridging, and 802.1Q devices are examples of the model in Figure 5-4(c). Aliasing interfaces are examples of the model in Figure 5-4(b). The model in Figure 5-4(a) can be seen as a special case of the other two.

与内核网络堆栈的交互

Interaction with the Kernel Network Stack

虚拟设备和真实设备与内核交互的方式略有不同。例如,它们在以下几点上有所不同:

Virtual devices and real devices interact with the kernel in slightly different ways. For example, they differ with regard to the following points:

初始化
Initialization

大多数虚拟设备都被分配了一个net_device数据结构,就像真实设备一样。通常,大多数虚拟设备的net_device函数指针都被初始化为作为包装器实现的例程,或多或少复杂,围绕关联的真实设备使用的函数指针。

然而,并非所有虚拟设备被分配一个net_device实例。别名设备就是一个例子;它们在相关的真实设备上被实现为简单的标签(参见第30章中的“老一代配置:别名接口”部分)。

Most virtual devices are assigned a net_device data structure, as real devices are. Often, most of the virtual device's net_device's function pointers are initialized to routines implemented as wrappers, more or less complex, around the function pointers used by the associated real devices.

However, not all virtual devices are assigned a net_device instance. Aliasing devices are an example; they are implemented as simple labels on the associated real device (see the section "Old-generation configuration: aliasing interfaces" in Chapter 30).

配置
Configuration

提供临时用户空间工具来配置虚拟设备是很常见的,特别是对于仅适用于这些设备且无法使用ifconfig等标准工具进行配置的高级字段。

It is common to provide ad hoc user-space tools to configure virtual devices, especially for the high-level fields that apply only to those devices and which could not be configured using standard tools such as ifconfig.

外部接口
External interface

每个虚拟设备通常将一个文件或包含几个文件的目录导出到/proc文件系统。使用这些文件导出的信息的复杂程度和详细程度取决于虚拟设备的类型和设计。您将看到相关章节中“虚拟设备”部分列出的每个虚拟设备所使用的设备(对于本书中介绍的设备)。与虚拟设备关联的文件是额外文件;它们不会取代与物理设备关联的设备。没有自己的net_device实例的别名设备也是一个例外。

Each virtual device usually exports a file, or a directory with a few files, to the /proc filesystem. How complex and detailed the information exported with those files is depends on the kind of virtual device and on the design. You will see the ones used by each virtual device listed in the section "Virtual Devices" in their associated chapters (for those devices covered in this book). Files associated with virtual devices are extra files; they do not replace the ones associated with the physical devices. Aliasing devices, which do not have their own net_device instances, are again an exception.

传播
Transmission

当虚拟设备与真实设备的关系不是一对一时,用于传输的例程可能需要包括选择要使用的真实设备等任务。[ * ]由于 QoS 是在每个设备的基础上强制实施的,因此虚拟设备和关联的真实设备之间的多种关系会对流量控制配置产生影响。

When the relationship of virtual device to real device is not one-to-one, the routine used to transmit may need to include, among other tasks, the selection of the real device to use.[*] Because QoS is enforced on a per-device basis, the multiple relationships between virtual devices and associated real devices have implications for the Traffic Control configuration.

接待
Reception

由于虚拟设备是软件对象,因此它们不需要与系统上的真实资源进行交互,例如注册 IRQ 处理程序或分配 I/O 端口和 I/O 内存。他们的流量来自执行这些任务的物理设备。对于不同类型的虚拟设备,数据包接收的情况有所不同。例如,802.1Q 接口注册一个以太网类型,并且仅传递由相关真实设备接收到的携带正确协议 ID 的数据包。[ ]相反,桥接接口接收从关联设备到达的任何数据包(请参阅第 16 章))。

Because virtual devices are software objects, they do not need to engage in interactions with real resources on the system, such as registering an IRQ handler or allocating I/O ports and I/O memory. Their traffic comes secondhand from the physical devices that perform those tasks. Packet reception happens differently for different types of virtual devices. For instance, 802.1Q interfaces register an Ethertype and are passed only those packets received by the associated real devices that carry the right protocol ID.[] In contrast, bridge interfaces receive any packet that arrives from the associated devices (see Chapter 16).

外部通知
External notifications

来自其他内核组件的关于内核中发生的特定事件的通知对于虚拟设备和真实设备来说都是同样重要的由于虚拟设备的逻辑是在真实设备之上实现的,因此后者不了解该逻辑,因此无法传递这些通知。因此,通知需要直接发送到虚拟设备。让我们以绑定为例:如果捆绑包中的一台设备出现故障,则必须让用于在捆绑包成员之间分配流量的算法意识到这一点,以便它们不会选择不再可用的设备。

与这些软件触发的通知不同,硬件触发的通知(例如,PCI电源管理)不能直接到达虚拟设备,因为没有与虚拟设备关联的硬件。

Notifications from other kernel components about specific events taking place in the kernel[] are of interest as much to virtual devices as to real ones. Because virtual devices' logic is implemented on top of real devices, the latter have no knowledge about that logic and therefore are not able to pass on those notifications. For this reason, notifications need to go directly to the virtual devices. Let's use Bonding as an example: if one device in the bundle goes down, the algorithms used to distribute traffic among the bundle's members have to be made aware of that so that they do not select the devices that are no longer available.

Unlike these software-triggered notifications, hardware-triggered notifications (e.g., PCI power management) cannot reach virtual devices directly because there is no hardware associated with virtual devices.

通过 /proc 文件系统进行调整

Tuning via /proc Filesystem

图 5-5显示了可用于调整或查看与本章涵盖的主题相关的配置参数的状态的文件。

Figure 5-5 shows the files that can be used either to tune or to view the status of configuration parameters related to the topics covered in this chapter.

/proc/sys/kernel中有文件modprobehotplug ,它们可以更改前面“用户空间助手” 部分中介绍的两个程序的路径名。

In /proc/sys/kernel are the files modprobe and hotplug that can change the pathnames of the two programs introduced earlier in the section "User-Space Helpers."

/proc中的一些文件导出内部数据结构和配置参数中的值,这对于跟踪设备驱动程序分配的资源非常有用,如前面“ NIC 初始化的基本目标”部分所示。对于其中一些数据结构,提供了用户空间命令以更用户友好的格式打印其内容。例如,lsmod使用/proc/modules作为信息源列出当前加载的模块。

A few files in /proc export the values within internal data structures and configuration parameters, which are useful to track what resources were allocated by device drivers, shown earlier in the section "Basic Goals of NIC Initialization." For some of these data structures, a user-space command is provided to print their contents in a more user-friendly format. For example, lsmod lists the modules currently loaded, using /proc/modules as its source of information.

/proc/net中,您可以找到由以下命令创建的文件 net_dev_init、viadev_proc_initdev_mcast_init(参见前面的部分“初始化设备处理层:net_dev_init ”):

In /proc/net, you can find the files created by net_dev_init, via dev_proc_init and dev_mcast_init (see the earlier section "Initializing the Device Handling Layer: net_dev_init"):

开发者
dev

显示向内核注册的每个网络设备的一些有关接收和传输的统计信息,例如接收或传输的字节数、数据包数量、错误等。

Displays, for each network device registered with the kernel, a few statistics about reception and transmission, such as bytes received or transmitted, number of packets, errors, etc.

dev_mcast
dev_mcast

显示向内核注册的每个网络设备,IP 多播使用的一些参数的值。

Displays, for each network device registered with the kernel, the values of a few parameters used by IP multicast.

无线的
wireless

与dev类似,对于每个无线设备,打印虚拟函数返回的无线块中的一些参数的值 dev->get_wireless_stats。请注意,dev->get_wireless_stats仅返回无线设备的内容,因为它们会分配一个数据结构来保存这些统计信息(因此/proc/net/wireless 将仅包含无线设备)。

Similarly to dev, for each wireless device, prints the values of a few parameters from the wireless block returned by the dev->get_wireless_stats virtual function. Note that dev->get_wireless_stats returns something only for wireless devices, because those allocate a data structure to keep those statistics (and so /proc/net/wireless will include only wireless devices).

软网统计
softnet_stat

导出有关网络代码使用的软件中断的统计信息。参见第 12 章

Exports statistics about the software interrupts used by the networking code. See Chapter 12.

/proc 与路由子系统相关的文件

图 5-5。/proc 与路由子系统相关的文件

Figure 5-5. /proc files related to the routing subsystem

还有其他有趣的目录,包括/proc/drivers/proc/bus/proc/irq,我建议您参考Linux Device Drivers。此外,内核参数逐渐从 /proc移出到名为/sys的目录中,但由于篇幅有限,我不会描述新系统。

There are other interesting directories, including /proc/drivers, /proc/bus, and /proc/irq, for which I refer you to Linux Device Drivers. In addition, kernel parameters are gradually being moved out of /proc and into a directory called /sys, but I won't describe the new system for lack of space.

本章介绍的函数和变量

Functions and Variables Featured in This Chapter

表 5-1总结了本章介绍的函数、宏、变量和数据结构。

Table 5-1 summarizes the functions, macros, variables, and data structures introduced in this chapter.

表 5-1。与系统初始化相关的函数、宏、变量和数据结构

Table 5-1. Functions, macros, variables, and data structures related to system initialization

姓名

Name

描述

Description

函数和宏

Functions and macros

request_irq

request_irq

free_irq

free_irq

分别注册和释放 IRQ 线的回调处理程序。注册可以是独占的,也可以是共享的。

Registers and releases, respectively, a callback handler for an IRQ line. The registration can be exclusive or shared.

request_region

request_region

release_region

release_region

分配和释放I/O端口和I/O内存。

Allocates and releases I/O ports and I/O memory.

call_usermodehelper

call_usermodehelper

调用用户空间帮助程序应用程序。

Invokes a user-space helper application.

module_param

module_param

宏用于定义模块的配置参数。

Macro used to define configuration parameters for modules.

net_dev_init

net_dev_init

在启动时初始化一段网络代码。

Initializes a piece of the networking code at boot time.

全局变量

Global variables

dev_boot_phase

dev_boot_phase

net_dev_init遗留代码使用布尔标志来强制在 NIC 设备驱动程序注册之前执行。

Boolean flag used by legacy code to enforce the execution of net_dev_init before NIC device drivers register themselves.

irq_desc

irq_desc

指向 IRQ 描述符向量的指针。

Pointer to the vector of IRQ descriptors.

数据结构

Data structure

 

struct irq_action

struct irq_action

每个 IRQ 线都由该结构的一个实例定义。除其他字段外,它还包括回调函数。

Each IRQ line is defined by an instance of this structure. Among other fields, it includes a callback function.

net_device

net_device

描述网络设备。

Describes a network device.

本章介绍的文件和目录

Files and Directories Featured in This Chapter

图 5-6列出了本章中提到的文件和目录。

Figure 5-6 lists the files and directories referred to in this chapter.

本章介绍的文件和目录

图 5-6。本章介绍的文件和目录

Figure 5-6. Files and directories featured in this chapter




[ * ]第 11 章详细描述了该功能。

[*] Chapter 11 describes this function in detail.

[ ]虚函数在第 11 章hard_start_xmit中描述。

[] The hard_start_xmit virtual function is described in Chapter 11.

[ * ]宏的使用BUG_ONBUG_TRAP一种常见机制,可确保在特定代码点满足必要条件,并且在从一种设计过渡到另一种设计时非常有用。

[*] The use of the macros BUG_ON and BUG_TRAP is a common mechanism to make sure necessary conditions are met at specific code points, and is useful when transitioning from one design to another.

[ ]有关 PCI 的示例,请参阅第 6 章中的“注册 PCI NIC 设备驱动程序”部分。

[] See the section "Registering a PCI NIC Device Driver" in Chapter 6 for an example involving PCI.

[ * ]请注意,由于设备驱动程序尚未加载,因此eth0也不存在。

[*] Note that because the device driver has not been loaded yet, eth0 does not exist yet either.

[ * ]第 6 章中的“ PCI NIC 驱动程序注册示例”部分给出了 PCI 设备标识符的简要描述。

[*] The section "Example of PCI NIC Driver Registration" in Chapter 6 gives a brief description of a PCI device identifier.

[ ]管理员可以编写自己的脚本或使用最常见的 Linux 发行版提供的脚本。

[] The administrator can write his own scripts or use the ones provided by the most common Linux distributions.

[ * ]与 shell 脚本/sbin/hotplug不同,/sbin/modprobe是二进制可执行文件。如果您想看一下,请下载modutil包的源代码。

[*] Unlike /sbin/hotplug, which is a shell script, /sbin/modprobe is a binary executable file. If you want to give it a look, download the source code of the modutil package.

[ * ]有关数据包传输的更多详细信息,请参阅第 11 章dev_queue_xmit,特别是。

[*] See Chapter 11 for more details on packet transmission in general, and dev_queue_xmit in particular.

[ ]第 13 章讨论了基于协议标识符的入口流量的多路分解。

[] Chapter 13 discusses the demultiplexing of ingress traffic based on the protocol identifier.

[ ]第 4 章定义了通知链并解释了它们可用于何种类型的通知。

[] Chapter 4 defines notification chains and explains what kind of notifications they can be used for.

第 6 章 PCI 层和网络接口卡

Chapter 6. The PCI Layer and Network Interface Cards

鉴于 PCI 总线在 x86 以及其他架构上的普及,我们将花几页的时间来了解 PCI 设备是如何由内核管理的,特别是网络设备。本章将帮助您找到有关设备注册的代码的上下文,我们将在第 8 章中看到。您还将了解一些有关 PCI 如何处理一些漂亮的内核功能(例如探测和电源管理)的知识。有关 PCI 的深入讨论,例如设备驱动程序设计、PCI 总线特性和实现细节,请参阅Linux 设备驱动程序了解 Linux 内核以及 PCI 规范。

Given the popularity of the PCI bus, on the x86 as well as other architectures, we will spend a few pages on it so that you can understand how PCI devices are managed by the kernel, with special emphasis on network devices. This chapter will help you find a context for the code about device registration we will see in Chapter 8. You will also learn a bit about how PCI handles some nifty kernel features such as probing and power management. For an in-depth discussion of PCI, such as device driver design, PCI bus features, and implementation details, refer to Linux Device Drivers and Understanding the Linux Kernel, as well as PCI specifications.

PCI子系统(也称为PCI层)在内核中提供了各种 PCI 设备驱动程序共同使用的所有通用函数。该子系统为每个单独的设备减轻了程序员的大量工作,让驱动程序以干净的方式编写,并使内核更容易收集和维护有关设备的信息,例如记帐信息和统计信息。

The PCI subsystem (also known as the PCI layer ) in the kernel provides all the generic functions that are used in common by various PCI device drivers. This subsystem takes a lot of work off the shoulders of the programmer for each individual device, lets drivers be written in a clean manner, and makes it easier for the kernel to collect and maintain information about the devices, such as accounting information and statistics.

在本章中,我们将了解 PCI 层使用的几个关键数据结构的含义以及这些结构如何由一个常见的 NIC 设备驱动程序初始化。最后我将简单介绍一下 PCI 电源管理和 LAN 唤醒功能。

In this chapter, we will see the meaning of a few key data structures used by the PCI layer and how these structures are initialized by one common NIC device driver. I'll conclude with a few words on the PCI power management and Wake-on-LAN features.

本章介绍的数据结构

Data Structures Featured in This Chapter

以下是 PCI 层使用的一些关键数据结构类型。还有很多其他内容,但为了在本书中进行概述,我们只需要了解以下内容。第一个定义在include/linux/mod_devicetable.h中,另外两个定义在include/linux/pci.h中。

Here are a few key data structure types used by the PCI layer. There are many others, but the following ones are all we need to know for our overview in this book. The first one is defined in include/linux/mod_devicetable.h, and the other two are defined in include/linux/pci.h.

pci_device_id
pci_device_id

设备标识符。这不是Linux使用的本地ID,而是根据PCI标准定义的ID。后面的“注册 PCI NIC 设备驱动程序”部分显示了 ID 的定义,后面的“ PCI NIC 驱动程序注册示例”部分提供了一个示例。

Device identifier. This is not a local ID used by Linux, but an ID defined accordingly to the PCI standard. The later section "Registering a PCI NIC Device Driver" shows the ID's definition, and the later section "Example of PCI NIC Driver Registration" presents an example.

pci_dev
pci_dev

每个 PCI 设备都分配有一个pci_dev 实例,就像网络设备分配有net_device实例一样。这是内核用来引用 PCI 设备的结构。

Each PCI device is assigned a pci_dev instance, just as network devices are assigned net_device instances. This is the structure used by the kernel to refer to a PCI device.

pci_driver
pci_driver

定义 PCI 层和设备驱动程序之间的接口。该结构主要由函数指针组成。所有 PCI 设备都使用它。其定义和初始化示例请参见后面的“ PCI NIC 驱动程序注册示例”。

Defines the interface between the PCI layer and the device drivers. This structure consists mostly of function pointers. All PCI devices use it. See the later section "Example of PCI NIC Driver Registration" for its definition and an example of its initialization.

PCI 设备驱动程序由结构实例定义pci_driver。这里是对其主要字段的描述,特别关注NIC设备的情况。函数指针由设备驱动程序初始化以指向该驱动程序中的适当函数。

PCI device drivers are defined by an instance of a pci_driver structure. Here is a description of its main fields, with special attention paid to the case of NIC devices. The function pointers are initialized by the device driver to point to appropriate functions within that driver.

char *name
char *name

司机姓名。

Name of the driver.

const struct pci_device_id *id_table
const struct pci_device_id *id_table

内核将使用 ID 向量将设备关联到该驱动程序。“ PCI NIC 驱动程序注册示例”部分显示了一个示例。

Vector of IDs the kernel will use to associate devices to this driver. The section "Example of PCI NIC Driver Registration" shows an example.

int (*probe)(struct pci_dev *dev, const struct pci_device_id *id)
int (*probe)(struct pci_dev *dev, const struct pci_device_id *id)

当 PCI 层找到其正在寻找驱动程序的设备 ID 与id_table 前面提到的设备 ID 之间的匹配时,由 PCI 层调用的函数。该函数应该启用硬件,分配结构 net_device,并初始化和注册新设备。[ * ]在此函数中,驱动程序还分配其正常工作所需的任何附加数据结构(例如,传输或接收期间使用的缓冲区环)。

Function invoked by the PCI layer when it finds a match between a device ID for which it is seeking a driver and the id_table mentioned previously. This function should enable the hardware, allocate the net_device structure, and initialize and register the new device.[*] In this function, the driver also allocates any additional data structures (e.g., buffer rings used during transmission or reception) that it may need to work properly.

void (*remove)(struct pci_dev *dev)
void (*remove)(struct pci_dev *dev)

当驱动程序从内核取消注册或热插拔设备被移除时,PCI 层调用的函数。它是 的对应部分probe,用于清理任何数据结构和状态。

网络设备使用此函数来释放分配的 I/O 端口和 I/O 内存、取消注册设备、释放数据结构net_device和任何其他可能已由设备驱动程序分配的辅助数据结构(通常在其probe函数中) 。

Function invoked by the PCI layer when the driver is unregistered from the kernel or when a hot-pluggable device is removed. It is the counterpart of probe and is used to clean up any data structure and state.

Network devices use this function to release the allocated I/O ports and I/O memory, to unregister the device, and to free the net_device data structure and any other auxiliary data structure that could have been allocated by the device driver, usually in its probe function.

int (*suspend)(struct pci_dev *dev, pm_message_t state)
int (*suspend)(struct pci_dev *dev, pm_message_t state)

int (*resume)(struct pci_dev *dev)
int (*resume)(struct pci_dev *dev)

当系统进入挂起模式和恢复模式时,PCI 层分别调用的函数。请参阅后面的“电源管理和 LAN 唤醒”部分。

Functions invoked by the PCI layer when the system goes into suspend mode and when it is resumed, respectively. See the later section "Power Management and Wake-on-LAN."

int (*enable_wake)(struct pci_dev *dev, u32 state, int enable)
int (*enable_wake)(struct pci_dev *dev, u32 state, int enable)

通过此功能,驱动程序可以启用或禁用设备通过生成特定电源管理事件信号来唤醒系统的功能。请参阅后面的“电源管理和 LAN 唤醒”部分。

With this function, a driver can enable or disable the capability of the device to wake the system up by generating specific Power Management Event signals. See the later section "Power Management and Wake-on-LAN."

struct pci_dynids dynids
struct pci_dynids dynids

动态 ID。请参阅以下部分。

Dynamic IDs. See the following section.

实例初始化示例请参见后面章节“ PCI网卡驱动注册示例pci_driver

See the later section "Example of PCI NIC Driver Registration" for an example of initialization of a pci_driver instance.

注册 PCI NIC 设备驱动程序

Registering a PCI NIC Device Driver

PCI设备由参数组合唯一标识,包括供应商、型号等。这些参数由内核存储在 类型的数据结构中pci_device_id,定义如下:

PCI devices are uniquely identified by a combination of parameters, including vendor, model, etc. These parameters are stored by the kernel in a data structure of type pci_device_id, defined as follows:

结构 pci_device_id {
     unsigned int 供应商、设备;
     unsigned int 子供应商、子设备;
     无符号整型类,class_mask;
     无符号长驱动程序数据;
};
struct pci_device_id {
     unsigned int vendor, device;
     unsigned int subvendor, subdevice;
     unsigned int class, class_mask;
     unsigned long driver_data;
};

大多数字段都是不言自明的。vendor通常 device足以识别设备。 subvendorsubdevice很少需要,通常设置为通配符值 ( PCI_ANY_ID)。classclass_mask表示设备所属的类;NETWORK 类涵盖了我们在本章中讨论的设备。driver_data不是 PCI ID 的一部分;它是驱动程序使用的私有参数。

Most of the fields are self-explanatory. vendor and device are usually sufficient to identify the device. subvendor and subdevice are rarely needed and are usually set to a wildcard value (PCI_ANY_ID). class and class_mask represent the class the device belongs to; NETWORK is the class that covers the devices we discuss in this chapter. driver_data is not part of the PCI ID; it is a private parameter used by the driver.

每个设备驱动程序都会向内核注册一个pci_device_id实例向量,其中列出了它可以处理的设备的 ID。

Each device driver registers with the kernel a vector of pci_device_id instances that lists the IDs of the devices it can handle.

pci_register_driverPCI 设备驱动程序分别使用和向内核注册和取消注册pci_unregister_driver这些函数在drivers/pci/pci.c中定义。还有pci_module_init一个别名pci_register_driver。一些驱动程序仍在使用pci_module_init,这是在引入之前的旧内核版本中提供的内核例程的名称pci_register_driver

PCI device drivers register and unregister with the kernel with pci_register_driver and pci_unregister_driver, respectively. These functions are defined in drivers/pci/pci.c. There is also pci_module_init, an alias for pci_register_driver. A few drivers still use pci_module_init, which is the name of the routine the kernel provided in older kernel versions before the introduction of pci_register_driver.

pci_register_driver需要一个pci_driver数据结构作为参数。由于 的 pci_driver向量id_table,内核知道驱动程序可以处理哪些设备,并且由于 的​​所有虚拟函数pci_driver,内核具有与与驱动程序关联的任何设备进行交互的机制。

pci_register_driver requires a pci_driver data structure as an argument. Thanks to the pci_driver's id_table vector, the kernel knows what devices the driver can handle, and thanks to all the virtual functions that are part of pci_driver, the kernel has a mechanism to interact with any device that will be associated with the driver.

PCI 的一大优势是它对 探测IRQ 和每个设备所需的其他资源的完美支持。模块可以在加载时传递输入参数,告诉它如何配置它负责的所有设备,但有时(尤其是 PCI 等总线)更容易让驱动程序本身检查系统上的设备并配置它负责的那些。如有必要,用户仍然可以依靠手动配置。

One of the great advantages of PCI is its elegant support for probing to find the IRQ and other resources each device needs. A module can be passed input parameters at load time to tell it how to configure all the devices for which it is responsible, but sometimes (especially with buses such as PCI) it is easier to let the driver itself check the devices on the system and configure the ones for which it is responsible. The user can still fall back on manual configuration if necessary.

/sys文件系统导出有关系统总线(PCI、USB 等)的信息,包括各种设备以及它们之间的关系。/sys还允许管理员为给定的设备驱动程序定义新的 ID,以便除了驱动程序使用其结构pci_driver向量注册的静态 ID 之外id_table,内核还可以使用用户配置的参数。

The /sys filesystem exports information about system buses (PCI, USB, etc.), including the various devices and relationships between them. /sys also allows an administrator to define new IDs for a given device driver so that besides the static IDs registered by the drivers with their pci_driver structures' id_table vector, the kernel can use the user-configured parameters.

我们不会介绍内核用来根据设备 ID 查找驱动程序的探测机制。然而,值得一提的是,有两种类型的探测:

We will not cover the probing mechanism used by the kernel to look up a driver based on the device IDs. However, it is worth mentioning that there are two types of probing:

静止的
Static

pci_driver给定设备 PCI ID,内核可以根据向量查找正确的 PCI 驱动程序(即实例 ) id_table。这称为静态探测。

Given a device PCI ID, the kernel can look up the right PCI driver (i.e., the pci_driver instance) based on the id_table vectors. This is called static probing.

动态的
Dynamic

这是基于用户手动配置的 ID 的查找,这种做法很少见,但偶尔有用,例如用于调试。动态是指系统管理员可以添加ID;这并不意味着 ID 可以自行更改。

This is a lookup based on IDs the user configures manually, a rare practice but one that is occasionally useful, as for debugging. Dynamic refers to the system administrator's ability to add an ID; it does not mean the ID can change on its own.

由于动态 ID 是在正在运行的系统上配置的,因此仅当内核编译为支持热插拔时它们才有用。

Since dynamic IDs are configured on a running system, they are useful only when the kernel is compiled with support for Hotplug.

电源管理和 LAN 唤醒

Power Management and Wake-on-LAN

PCI电源管理 事件由数据结构的suspend和函数处理。除了通过分别保存和恢复来处理 PCI 状态之外,这些功能还需要在 NIC 的情况下采取特殊步骤:resumepci_driver

PCI power management events are processed by the suspend and resume functions of the pci_driver data structure. Besides taking care of the PCI state, by saving and restoring it, respectively, these functions need to take special steps in the case of NICs:

  • suspend主要是停止设备出口队列,以便设备上不允许任何传输。

  • suspend mainly stops the device egress queue so that no transmission will be allowed on the device.

  • resume重新启用出口队列,以便设备再次可用于传输。

  • resume re-enables the egress queue so that the device is available again for transmissions.

局域网唤醒 (WOL)该功能允许 NIC 在收到特定类型的帧时唤醒处于待机模式的系统。默认情况下,WOL 通常处于禁用状态。可以使用 来打开和关闭该功能pci_enable_wake

Wake-on-LAN (WOL) is a feature that allows an NIC to wake up a system that's in standby mode when it receives a specific type of frame. WOL is normally disabled by default. The feature can be turned on and off with pci_enable_wake.

当 WOL 功能首次引入时,只有一种帧可以唤醒系统:“Magic Packets”。[ * ]这些特殊框架有两个主要特点:

When the WOL feature was first introduced, only one kind of frame could wake up a system: "Magic Packets."[*] These special frames have two main characteristics:

  • 目标 MAC 地址属于接收 NIC(无论该地址是单播、多播还是广播)。

  • The destination MAC address belongs to the receiving NIC (whether the address is unicast, multicast, or broadcast).

  • 在帧中的某处(任何地方)设置了 48 位序列(即 FF:FF:FF:FF:FF:FF),后跟连续重复至少 16 次的 NIC MAC 地址。

  • Somewhere (anywhere) in the frame a sequence of 48 bits is set (i.e., FF:FF:FF:FF:FF:FF) followed by the NIC MAC address repeated at least 16 times in a row.

现在也可以允许其他帧类型唤醒系统。少数设备可以根据在模块加载时设置的参数来启用或禁用 WOL 功能(有关示例,请参阅drivers/net/3c59x.c ) 。ethtool工具允许管理员配置哪些类型的帧可以唤醒系统。一种选择是 ARP 数据包,如第 28 章“ LAN 唤醒事件”部分所述。net -utils软件包包含一个命令ether-wake,可用于生成 WOL 以太网帧。

Now it is possible to allow other frame types to wake up the system, too. A handful of devices can enable or disable the WOL feature based on a parameter that can be set at module load time (see drivers/net/3c59x.c for an example).The ethtool tool allows an administrator to configure what kind of frames can wake up the system. One choice is ARP packets, as described in the section "Wake-on-LAN Events" in Chapter 28. The net-utils package includes a command, ether-wake, that can be used to generate WOL Ethernet frames.

每当启用 WOL 的设备识别出其类型允许唤醒系统的帧时,它就会生成执行该工作的电源管理通知。

Whenever a WOL-enabled device recognizes a frame whose type is allowed to wake up the system, it generates a power management notification that does the job.

有关电源管理的更多详细信息,请参阅后面第 8 章中的“与电源管理的交互” 部分。

For more details on power management, refer to the later section "Interactions with Power Management" in Chapter 8.

PCI NIC 驱动程序注册示例

Example of PCI NIC Driver Registration

让我们使用drivers/net/e100.c中的 Intel PRO/100 以太网驱动程序 来说明驱动程序注册:

Let's use the Intel PRO/100 Ethernet driver in drivers/net/e100.c to illustrate a driver registration:

#define INTEL_8255X_ETHERNET_DEVICE(device_id, ich) {\
     PCI_VENDOR_ID_INTEL、device_id、PCI_ANY_ID、PCI_ANY_ID、\
     PCI_CLASS_NETWORK_ETHERNET << 8,0xFFFF00,同上 }
静态结构 pci_device_id e100_id_table[] = {
     INTEL_8255X_ETHERNET_DEVICE(0x1029, 0),
     INTEL_8255X_ETHERNET_DEVICE(0x1030, 0),
     ...
}
#define INTEL_8255X_ETHERNET_DEVICE(device_id, ich) {\
     PCI_VENDOR_ID_INTEL, device_id, PCI_ANY_ID, PCI_ANY_ID, \
     PCI_CLASS_NETWORK_ETHERNET << 8, 0xFFFF00, ich }
static struct pci_device_id e100_id_table[] = {
     INTEL_8255X_ETHERNET_DEVICE(0x1029, 0),
     INTEL_8255X_ETHERNET_DEVICE(0x1030, 0),
     ...
}

我们在“注册 PCI NIC 设备驱动程序”一节中看到,PCI NIC 设备驱动程序向内核注册一个结构体向量pci_device_id,其中列出了它可以处理的设备。e100_id_table例如,e100.c驱动程序使用的结构。注意:

We saw in the section "Registering a PCI NIC Device Driver" that a PCI NIC device driver registers with the kernel a vector of pci_device_id structures that lists the devices it can handle. e100_id_table is, for instance, the structure used by the e100.c driver. Note that:

  • 第一个字段(对应于vendor结构的定义)具有固定值,其PCI_VENDOR_ID_INTEL初始化为分配给英特尔的供应商 ID。[ * ]

  • The first field (which corresponds to vendor in the structure's definition) has the fixed value of PCI_VENDOR_ID_INTEL which is initialized to the vendor ID assigned to Intel.[*]

  • 第三和第四字段(subvendorsubdevice)通常初始化为通配符值PCI_ANY_ID,因为前两个字段(vendordevice)足以识别设备。

  • The third and fourth fields (subvendor and subdevice) are often initialized to the wildcard value PCI_ANY_ID, because the first two fields (vendor and device) are sufficient to identify the devices.

  • 许多设备使用设备表上的宏_ _devinitdata将其标记为初始化数据,尽管事实e100_id_table并非如此。您将在第 7 章中看到该宏的具体用途。

  • Many devices use the macro _ _devinitdata on the table of devices to mark it as initialization data, although e100_id_table does not. You will see in Chapter 7 exactly what that macro is used for.

该模块由 初始化e100_init_module,由宏指定module_init[ * ]当内核在引导时或模块加载时执行该函数时,它会调用“注册 PCI NIC 设备驱动程序pci_module_init”部分中介绍的函数。该函数注册驱动程序,并间接注册所有关联的 NIC,如后面的“总体情况”部分中简要描述的那样。

The module is initialized by e100_init_module, as specified by the module_init macro.[*] When the function is executed by the kernel at boot time or at module loading time, it calls pci_module_init, the function introduced in the section "Registering a PCI NIC Device Driver." This function registers the driver, and, indirectly, all the associated NICs, as briefly described in the later section "The Big Picture."

以下快照显示了 e100 驱动程序中与 PCI 层接口相关的关键部分:

The following snapshot shows the key parts of the e100 driver with regard to the PCI layer interface:

名称“e100”

静态 int _ _devinit e100_probe(struct pci_dev *pdev,
     const struct pci_device_id *ent)
{
     ...
}
静态无效 _ _devexit e100_remove(struct pci_dev *pdev)
{
     ...
}

#ifdef CONFIG_PM
静态 int e100_suspend(结构 pci_dev *pdev,u32 状态)
{
     ...
}
静态 int e100_resume(struct pci_dev *pdev)
{
     ...
}
#万一

静态结构 pci_driver e100_driver = {
     .name = DRV_NAME,
     .id_table = e100_id_table,
     .probe = e100_probe,
     .remove = _ _devexit_p(e100_remove),
#ifdef CONFIG_PM
     .挂起= e100_挂起,
     .resume = e100_resume,
#万一
};

静态 int _ _init e100_init_module(void)
{
     ...
     返回 pci_module_init(&e100_driver);
}

静态无效__exit e100_cleanup_module(无效)
{
     pci_unregister_driver(&e100_driver);
}

module_init(e100_init_module);
module_exit(e100_cleanup_module);
NAME "e100"

static int _ _devinit e100_probe(struct pci_dev *pdev,
     const struct pci_device_id *ent)
{
     ...
}
static void _ _devexit e100_remove(struct pci_dev *pdev)
{
     ...
}

#ifdef CONFIG_PM
static int e100_suspend(struct pci_dev *pdev, u32 state)
{
     ...
}
static int e100_resume(struct pci_dev *pdev)
{
     ...
}
#endif

static struct pci_driver e100_driver = {
     .name =         DRV_NAME,
     .id_table =     e100_id_table,
     .probe =        e100_probe,
     .remove =       _ _devexit_p(e100_remove),
#ifdef CONFIG_PM
     .suspend =      e100_suspend,
     .resume =       e100_resume,
#endif
};

static int _ _init e100_init_module(void)
{
     ...
     return pci_module_init(&e100_driver);
}

static void _ _exit e100_cleanup_module(void)
{
     pci_unregister_driver(&e100_driver);
}

module_init(e100_init_module);
module_exit(e100_cleanup_module);

另请注意:

Also note that:

  • suspendresume仅当内核支持电源管理时才会初始化 和 ,因此仅当该条件为真时,这两个例程和e100_suspend才会 e100_resume包含在映像中。

  • suspend and resume are initialized only when the kernel has support for power management, so the two routines e100_suspend and e100_resume are included in the image only when that condition is true.

  • remove的字段用宏pci_driver标记_ _devexit_p,并且e100_remove用 标记_ _devexit

  • The remove field of pci_driver is tagged with the _ _devexit_p macro, and e100_remove is tagged with _ _devexit.

  • e100_probe被标记为_ _devinit.

  • e100_probe is tagged with _ _devinit.

您将在第 7 章中看到_ _dev XXX列表中提到的宏的用途。

You will see in Chapter 7 what the _ _dev XXX macros mentioned in the list are used for.

大局观

The Big Picture

让我们将前面几节中看到的内容放在一起,看看在具有 PCI 总线和一些 PCI 设备的系统中启动时会发生什么。[ * ]

Let's put together what we saw in the previous sections and see what happens at boot time in a system with a PCI bus and a few PCI devices.[*]

当系统启动时,它会创建一种数据库,将每个总线与检测到的使用该总线的设备列表相关联。例如,PCI 总线的描述符包括检测到的 PCI 设备的列表等参数。正如我们在“注册 PCI NIC 设备驱动程序”一节中看到的,每个 PCI 设备都由结构中的大量字段集合唯一标识 pci_device_id,尽管通常只需要少数几个字段。我们还了解了 PCI 设备驱动程序如何定义 PCI 层的实例pci_driver并使用pci_register_driver(或其别名,pci_module_init)进行注册。当加载设备驱动程序时,内核已经构建了其数据库:[ ]我们以图 6-1(a)中的三个 PCI 设备为例看看加载设备驱动程序 A 和 B 时会发生什么。

When the system boots, it creates a sort of database that associates each bus to a list of detected devices that use the bus. For example, the descriptor for the PCI bus includes, among other parameters, a list of detected PCI devices. As we saw in the section "Registering a PCI NIC Device Driver," each PCI device is uniquely identified by a large collection of fields in the structure pci_device_id, although only a few are usually necessary. We also saw how PCI device drivers define an instance of pci_driver and register with the PCI layer with pci_register_driver (or its alias, pci_module_init). By the time device drivers are loaded, the kernel has already built its database:[] let's then take the example of Figure 6-1(a) with three PCI devices and see what happens when device drivers A and B are loaded.

当设备驱动程序 A 加载时,它通过调用pci_register_driver并提供其实例来向 PCI 层注册pci_driver。该pci_driver 结构包括一个向量,其中包含它可以驱动的那些 PCI 设备的 ID。然后 PCI 层使用该表来查看在其检测到的 PCI 设备列表中匹配的设备。因此,它创建了如图 6-1(b)所示的驱动程序设备列表。另外,对于每个匹配设备,PCI层调用 probe其结构中匹配驱动程序提供的函数 pci_driver。该probe函数创建并注册关联的网络设备。在本例中,设备Dev3需要一个额外的设备驱动程序,称为 B。当驱动程序 B 最终向内核注册时,Dev3将分配给它。图6-1(c)显示了加载驱动程序的结果。

When device driver A is loaded, it registers with the PCI layer by calling pci_register_driver and providing its instance of pci_driver. The pci_driver structure includes a vector with the IDs of those PCI devices it can drive. The PCI layer then uses that table to see what devices match in its list of detected PCI devices. It thus creates the driver's device list shown in Figure 6-1(b). In addition, for each matching device, the PCI layer invokes the probe function provided by the matching driver in its pci_driver structure. The probe function creates and registers the associated network device. In this case, device Dev3 needs an additional device driver, called B. When driver B eventually registers with the kernel, Dev3 will be assigned to it. Figure 6-1(c) shows the results of loading the driver.

总线和驱动程序之间以及驱动程序和设备之间的绑定

图 6-1。总线和驱动程序之间以及驱动程序和设备之间的绑定

Figure 6-1. Binding between bus and drivers, and between driver and devices

当稍后卸载驱动程序时,模块的module_exit例程将调用pci_unregister_driver. PCI层然后,借助其数据库,遍历与驱动程序关联的所有设备并调用驱动程序的remove功能。该函数取消注册网络设备。

When the driver is unloaded later, the module's module_exit routine invokes pci_unregister_driver. The PCI layer then, thanks to its database, goes through all the devices associated with the driver and invokes the driver's remove function. This function unregisters the network device.

您可以在第 8 章probe中找到有关和函数内部结构的更多详细信息。remove

You can find more details about the internals of the probe and remove functions in Chapter 8.

通过 /proc 文件系统进行调整

Tuning via /proc Filesystem

/proc/pci文件可用于转储有关已注册 PCI 设备的信息。lspci命令是pciutils包的一部分 ,也可用于打印有关本地 PCI 设备的有用信息,但它从/sys检索其信息。

The /proc/pci file can be used to dump information about registered PCI devices. The lspci command, part of the pciutils package, can also be used to print useful information about the local PCI devices, but it retrieves its information from /sys.

本章介绍的函数和变量

Functions and Variables Featured in This Chapter

表 6-1总结了本章介绍的函数、宏和数据结构。

Table 6-1 summarizes the functions, macros, and data structures introduced in this chapter.

表 6-1。与 PCI 设备处理相关的函数、宏和数据结构

Table 6-1. Functions, macros, and data structures related to PCI device handling

姓名

Name

描述

Description

函数和宏

Functions and macros

 

pci_register_driver

pci_register_driver

pci_unregister_driver

pci_unregister_driver

pci_module_init

pci_module_init

注册、取消注册和初始化 PCI 驱动程序。

Register, unregister, and initialize a PCI driver.

数据结构

Data structure

 

struct pci_driver

struct pci_driver

struct pci_device_id

struct pci_device_id

struct pci_dev

struct pci_dev

第一个数据结构定义 PCI 驱动程序(并且主要由虚拟函数回调组成)。第二个存储与 PCI 设备关联的通用 ID。最后一个代表内核空间中的 PCI 设备。

The first data structure defines a PCI driver (and consists mostly of virtual function callbacks). The second stores the universal ID associated with a PCI device. The last one represents a PCI device in kernel space.

本章介绍的文件和目录

Files and Directories Featured in This Chapter

图 6-2 列出了本章中提到的文件和目录。该图不包括本章涵盖的主题使用的所有文件。例如,drivers/pci/目录包含几个其他文件。

Figure 6-2 lists the files and directories referred to in the chapter. The figure does not include all the files used by the topics covered in the chapter. For example, the drivers/pci/ directory includes several other files.

本章介绍的文件和目录

图 6-2。本章介绍的文件和目录

Figure 6-2. Files and directories featured in this chapter




[ * ]第 8 章介绍了 NIC 注册。

[*] NIC registration is covered in Chapter 8.

[ * ] WOL 由 AMD 推出,名称为“Magic Packet Technology”。

[*] WOL was introduced by AMD with the name "Magic Packet Technology."

[ * ]您可以在http://pciids.sourceforge.net找到更新的列表。

[*] You can find an updated list at http://pciids.sourceforge.net.

[ * ]有关模块初始化代码的更多详细信息,请参阅第 7 章。

[*] See Chapter 7 for more details on module initialization code.

[ * ]其他总线的行为方式类似。详细信息请参阅Linux设备驱动程序。

[*] Other buses behave in a similar way. Please refer to Linux Device Drivers for details.

[ ]这可能不适用于所有总线类型。

[] This may not be possible for all bus types.

第 7 章组件初始化的内核基础结构

Chapter 7. Kernel Infrastructure for Component Initialization

要完全理解内核组件,您不仅必须知道给定的一组例程的作用,而且还要知道这些例程何时以及由谁调用。的初始化子系统是内核根据其自身模型处理的基本任务之一。该基础设施值得研究,可以帮助您了解网络堆栈的核心组件(包括 NIC 设备驱动程序)是如何初始化的。

To fully understand a kernel component, you have to know not only what a given set of routines does, but also when those routines are invoked and by whom. The initialization of a subsystem is one of the basic tasks handled by the kernel according to its own model. This infrastructure is worth studying to help you understand how core components of the networking stack are initialized, including NIC device drivers.

本章的目的是展示内核如何处理用于初始化内核组件的例程,包括静态包含在内核中的组件和作为内核模块加载的组件,特别强调网络设备。因此我们将看到:

The purpose of this chapter is to show how the kernel handles routines used to initialize kernel components, both for components statically included into the kernel and those loaded as kernel modules, with a special emphasis on network devices. We will therefore see:

  • 初始化函数如何通过特殊宏命名和识别

  • How initialization functions are named and identified by special macros

  • 如何根据内核配置定义这些宏,以优化内存使用并确保各种初始化以正确的顺序完成

  • How these macros are defined, based on the kernel configuration, to optimize memory usage and make sure that the various initializations are done in the correct order

  • 函数何时以及如何执行

  • When and how the functions are executed

我们不会涵盖初始化基础设施的所有细节,但您将有足够的概述来轻松浏览源代码。

We will not cover all details of the initialization infrastructure, but you'll have a sufficient overview to navigate the source code comfortably.

启动时内核选项

Boot-Time Kernel Options

Linux 允许用户传递内核配置选项到它们的引导加载程序,然后将选项传递给内核;有经验的用户可以使用此机制在启动时微调内核。[ * ]在引导阶段,如第5章中的图5-1所示,这两个调用负责处理引导时配置输入。我们将在下一节中了解为什么会调用两次,详细内容将在后面的“两次传递解析”一节中进行详细介绍。parse_argsparse_args

Linux allows users to pass kernel configuration options to their boot loaders, which then pass the options to the kernel; experienced users can use this mechanism to fine-tune the kernel at boot time.[*] During the boot phase, as shown in Figure 5-1 in Chapter 5, the two calls to parse_args take care of the boot-time configuration input. We will see in the next section why parse_args is called twice, with details in the later section "Two-Pass Parsing."

parse_args是一个例程,它解析带有name_variable=value形式的参数的输入字符串,查找特定关键字并调用正确的处理程序。parse_args加载模块时也使用它来解析提供的命令行参数(如果有)。

parse_args is a routine that parses an input string with parameters in the form name_variable=value, looking for specific keywords and invoking the right handlers. parse_args is also used when loading a module, to parse the command-line parameters provided (if any).

我们不需要知道如何parse_args实现解析的细节,但了解内核组件如何为关键字注册处理程序以及如何调用处理程序是很有趣的。为了有一个清晰的认识,我们需要学习:

We do not need to know the details of how parse_args implements the parsing, but it is interesting to see how a kernel component can register a handler for a keyword and how the handler is invoked. To have a clear picture we need to learn:

  • 内核组件如何注册关键字,以及当该关键字与引导字符串一起提供时将执行的关联处理程序。

  • How a kernel component can register a keyword, along with the associated handler that will be executed when that keyword is provided with the boot string.

  • 内核如何解析关键字和处理程序之间的关联。我将提供内核如何解析输入字符串的高级概述。

  • How the kernel resolves the association between keywords and handlers. I will offer a high-level overview of how the kernel parses the input string.

  • 网络设备子系统如何使用此功能。

  • How the networking device subsystem uses this feature.

所有解析代码都在kernel/params.c中。我们将一一介绍列表中的要点。

All the parsing code is in kernel/params.c. We'll cover the points in the list one by one.

注册关键字

Registering a Keyword

内核组件可以使用宏注册关键字和关联的处理程序,在include/linux/init.h_ _setup 中定义。这是它的语法:

Kernel components can register a keyword and the associated handler with the _ _setup macro, defined in include/linux/init.h. This is its syntax:

__setup(字符串,函数处理程序)
_ _setup(string, function_handler)

其中string是关键字,function_handler是关联的处理程序。刚刚显示的示例指示内核function_handler在输入引导时字符串包含 时执行stringstring必须以 = 字符结尾以使 的解析更容易parse_args。= 后面的任何文本都将作为输入传递给function_handler.

where string is the keyword and function_handler is the associated handler. The example just shown instructs the kernel to execute function_handler when the input boot-time string includes string. string has to end with the = character to make the parsing easier for parse_args. Any text following the = will be passed as input to function_handler.

以下是来自net/core/dev.c的示例,其中netdev_boot_setup 注册为netdev=关键字的处理程序:

The following is an example from net/core/dev.c, where netdev_boot_setup is registered as the handler for the netdev= keyword:

_ _setup("netdev=", netdev_boot_setup);
_ _setup("netdev=", netdev_boot_setup);

相同的处理程序可以与不同的关键字相关联。例如net/ethernet/eth.cnetdev_boot_setup为关键字注册相同的处理程序ether=

The same handler can be associated with different keywords. For instance net/ethernet/eth.c registers the same handler, netdev_boot_setup, for the ether= keyword.

当一段代码被编译为模块时,_ _setup宏将被忽略(即定义为无操作)。您可以检查include/linux/init.h_ _setup中宏的定义如何变化,具体取决于包含后一个文件的代码是否是模块。

When a piece of code is compiled as a module, the _ _setup macro is ignored (i.e., defined as a no-op). You can check how the definition of the _ _setup macro changes in include/linux/init.h depending on whether the code that includes the latter file is a module.

之所以start_kernel调用parse_args两次来解析启动配置字符串,是因为启动时选项实际上分为两类,每次调用处理一类:

The reason why start_kernel calls parse_args twice to parse the boot configuration string is that boot-time options are actually divided into two classes, and each call takes care of one class:

默认选项
Default options

大多数选项都属于这一类。这些选项是用宏定义的 _ _setup,并由第二次调用来处理parse_args

Most options fall into this category. These options are defined with the _ _setup macro and are handled by the second call to parse_args.

早期的选择
Early options

在内核引导期间,某些选项需要比其他选项更早处理。内核提供了early_param宏来声明这些选项,而不是_ _setup. 然后他们由 照顾parse_early_paramsearly_param和 之间的唯一区别_ _setup是前者设置了一个特殊的标志,以便内核能够区分这两种情况。该标志是我们将在“ .init.setup 内存部分obs_kernel_param”部分中看到的数据结构的一部分。

Some options need to be handled earlier than others during the kernel boot. The kernel provides the early_param macro to declare these options instead of _ _setup. They are then taken care of by parse_early_params. The only difference between early_param and _ _setup is that the former sets a special flag so that the kernel will be able to distinguish between the two cases. The flag is part of the obs_kernel_param data structure that we will see in the section ".init.setup Memory Section."

2.6 内核对引导时选项的处理发生了变化,但并非所有内核代码都已相应更新。在最新的更改之前,曾经只有宏_ _setup。因此,要更新的遗留代码现在使用宏_ _obsolete_setup。当用户向内核传递一个用_ _obsolete_setup宏声明的选项时,内核会打印一条有关其过时状态的消息警告,并提供指向声明后者的文件和源代码行的指针。

The handling of boot-time options has changed with the 2.6 kernel, but not all the kernel code has been updated accordingly. Before the latest changes, there used to be only the _ _setup macro. Because of this, legacy code that is to be updated now uses the macro _ _obsolete_setup. When the user passes the kernel an option that is declared with the _ _obsolete_setup macro, the kernel prints a message warning about its obsolete status and provides a pointer to the file and source code line where the latter is declared.

图 7-1总结了各种宏之间的关系:它们都是通用例程的包装器_ _setup_param

Figure 7-1 summarizes the relationship between the various macros: all of them are wrappers around the generic routine _ _setup_param.

请注意,传递给的输入例程_ _setup被放入.init.setup内存部分。此操作的效果将在“引导时初始化例程”部分中变得清晰。

Note that the input routine passed to _ _setup is placed into the .init.setup memory section. The effect of this action will become clear in the section "Boot-Time Initialization Routines."

setup_param 宏及其包装器

图 7-1。setup_param 宏及其包装器

Figure 7-1. setup_param macro and its wrappers

两遍解析

Two-Pass Parsing

由于以前的内核版本中引导时选项的处理方式不同,并且并非所有选项都已转换为新模型,因此内核会处理这两种模型。当新的基础设施无法识别关键字时,它会要求过时的基础设施来处理它。如果过时的基础设施也失败了,关键字和值将被传递到将在内核线程init末尾调用的进程(如第 5 章中的图 5-1所示)。关键字和值将添加到参数列表或环境变量列表中。initrun_init_processargenvp

Because boot-time options used to be handled differently in previous kernel versions, and not all of them have been converted to the new model, the kernel handles both models. When the new infrastructure fails to recognize a keyword, it asks the obsolete infrastructure to handle it. If the obsolete infrastructure also fails, the keyword and value are passed on to the init process that will be invoked at the end of the init kernel thread via run_init_process (shown in Figure 5-1 in Chapter 5). The keyword and value are added either to the arg parameter list or to the envp environment variable list.

上一节解释了,为了允许按必要的顺序处理早期选项,引导字符串解析和处理程序调用分两次处理,如图 7-2 所示(该图显示了第 5 章中介绍start_kernel的快照) :

The previous section explained that, to allow early options to be handled in the necessary order, boot-string parsing and handler invocation are handled in two passes, shown in Figure 7-2 (the figure shows a snapshot from start_kernel, introduced in Chapter 5):

  1. 第一遍仅查找必须尽早处理的较高优先级选项,这些选项由特殊标志 ( early) 标识。

  2. The first pass looks only for higher-priority options that must be handled early, which are identified by a special flag (early).

  3. 第二遍负责所有其他选项。大多数选项都属于这一类。遵循过时模型的所有选项都在此过程中处理。

  4. The second pass takes care of all other options. Most of the options fall into this category. All options following the obsolete model are handled in this pass.

第二遍首先检查是否与根据新基础设施实现的选项匹配。这些选项存储在数据结构中,由第 5 章“模块选项”一节中介绍的宏kernel_param填充。相同的宏确保所有这些数据结构都放置到由指针和分隔的特定内存部分 ( ) 中。module_param_ _param_ _ start_ _ _param_ _stop_ _ _param

The second pass first checks whether there is a match with the options implemented according to the new infrastructure. These options are stored in kernel_param data structures, filled in by the module_param macro introduced in the section "Module Options" in Chapter 5. The same macro makes sure that all of those data structures are placed into a specific memory section (_ _param), delimited by the pointers _ _ start_ _ _param and _ _stop_ _ _param.

当识别到这些选项之一时,相关参数将被初始化为引导字符串提供的值。当选项不匹配时,unknown_bootoption尝试查看该选项是否应由过时的模型处理程序处理(图 7-2)。

When one of these options is recognized, the associated parameter is initialized to the value provided with the boot string. When there is no match for an option, unknown_bootoption tries to see whether the option should be handled by the obsolete model handler (Figure 7-2).

两遍选项解析

图 7-2。两遍选项解析

Figure 7-2. Two-pass option parsing

过时的和新的模型选项被放置在两个不同的存储区域中:

Obsolete and new model options are placed into two different memory areas:

_ _setup_start..._ _setup_end
_ _setup_start ... _ _setup_end

我们将在后面的部分中看到该区域在引导阶段结束时被释放:一旦内核引导,就不再需要这些选项。用户无法在运行时查看或更改它们。

We will see in a later section that this area is freed at the end of the boot phase: once the kernel has booted, these options are not needed anymore. The user cannot view or change them at runtime.

_ _ start_ _ _param..._ _ stop_ _ _param
_ _ start_ _ _param ... _ _ stop_ _ _param

该区域未释放。其内容导出到/sys,其中选项向用户公开。

This area is not freed. Its content is exported to /sys, where the options are exposed to the user.

有关模块参数的更多详细信息,请参阅第 5 章。

See Chapter 5 for more details on module parameters.

另请注意,所有过时的模型选项,无论是否设置了标志 early,都会被放入_ _setup_start..._ _setup_end内存区域。

Also note that all obsolete model options, regardless of whether they have the early flag set, are placed into the _ _setup_start ... _ _setup_end memory area.

.init.setup 内存部分

.init.setup Memory Section

我们在上一节中介绍的宏的两个输入_ _setup被放置到类型为 的数据结构中,该数据结构在include/linux/init.hobs_kernel_param中定义:

The two inputs to the _ _setup macro we introduced in the previous section are placed into a data structure of type obs_kernel_param, defined in include/linux/init.h:

结构 obs_kernel_param {
    常量字符*str;
    int (*setup_func)(char*);
    早起;
};
struct obs_kernel_param {
    const char *str;
    int (*setup_func)(char*);
    int early;
};

str是关键字,setup_func是处理程序,是我们在“两遍解析early”部分中介绍的标志。

str is the keyword, setup_func is the handler, and early is the flag we introduced in the section "Two-Pass Parsing."

_ _setup_param宏将所有 obs_kernel_params实例放入专用内存区域。这样做主要有两个原因:

The _ _setup_param macro places all of the obs_kernel_params instances into a dedicated memory area. This is done mainly for two reasons:

  • 遍历所有实例会更容易,例如,在基于关键字进行查找时str。我们将看到内核在进行关键字查找时如何使用两个指针_ _setup_start 和,这两个指针分别指向前面提到的区域的开始和结束(如图 7-3_ _setup_end所示)。

  • It is easier to walk through all of the instances—for instance, when doing a lookup based on the str keyword. We will see how the kernel uses the two pointers _ _setup_start and _ _setup_end, that point respectively to the start and end of the previously mentioned area (as shown later in Figure 7-3), when doing a keyword lookup.

  • 当不再需要所有数据结构时,内核可以快速释放它们。我们将在“内存优化”部分中回顾这一点。

  • The kernel can quickly free all of the data structures when they are not needed anymore. We will go back to this point in the section "Memory Optimizations."

使用引导选项配置网络设备

Use of Boot Options to Configure Network Devices

根据我们在前面几节中看到的内容,让我们看看网络代码如何使用启动选项。

In light of what we saw in the previous sections, let's see how the networking code uses boot options.

我们已经在“注册关键字”部分中提到,ether=netdev=关键字都注册为使用相同的处理程序netdev_boot_setup。当调用该处理程序来处理输入参数(即匹配关键字后面的字符串)时,它将结果存储到include/linux/netdevice.hnetdev_boot_setup中定义的类型的数据结构中。处理程序和数据结构类型碰巧共享相同的名称,因此请确保不要混淆两者。

We already mentioned in the section "Registering a Keyword" that both the ether= and netdev= keywords are registered to use the same handler, netdev_boot_setup. When this handler is invoked to process the input parameters (i.e., the string that follows the matching keyword), it stores the result into data structures of type netdev_boot_setup, defined in include/linux/netdevice.h. The handler and the data structure type happen to share the same name, so make sure you do not confuse the two.

结构体netdev_boot_setup {
    字符名称[IFNAMSIZ];
    struct ifmap 映射;
};
struct netdev_boot_setup {
    char name[IFNAMSIZ];
    struct ifmap map;
};

name是设备的名称,在include/linux/if.hifmap中定义的是存储输入配置的数据结构:

name is the device's name, and ifmap, defined in include/linux/if.h, is the data structure that stores the input configuration:

结构体ifmap
{
    无符号长mem_start;
    无符号长mem_end;
    无符号短base_addr;
    无符号字符中断;
    无符号字符 DMA;
    无符号字符端口;
    /* 3字节空闲*/
};
struct ifmap
{
    unsigned long mem_start;
    unsigned long mem_end;
    unsigned short base_addr;
    unsigned char irq;
    unsigned char dma;
    unsigned char port;
    /* 3 bytes spare */
};

可以在启动时字符串中多次提供相同的关键字(针对不同的设备),如下例所示:

The same keyword can be provided multiple times (for different devices) in the boot-time string, as in the following example:

LILO: linux 以太=5,0x260,eth0 以太=15,0x300,eth1

但是,使用此机制在启动时可以配置的最大设备数为,这也是用于存储配置的NETDEV_BOOT_SETUP_MAX静态数组的大小:dev_boot_setup

However, the maximum number of devices that can be configured at boot time with this mechanism is NETDEV_BOOT_SETUP_MAX, which is also the size of the static array dev_boot_setup used to store the configurations:

静态结构 netdev_boot_setup dev_boot_setup[NETDEV_BOOT_SETUP_MAX];
static struct netdev_boot_setup dev_boot_setup[NETDEV_BOOT_SETUP_MAX];

netdev_boot_setup非常简单:它从字符串中提取输入参数,填充结构ifmap,然后将后者添加到数组dev_boot_setupnetdev_boot_setup_add

netdev_boot_setup is pretty simple: it extracts the input parameters from the string, fills in an ifmap structure, and adds the latter to the dev_boot_setup array with netdev_boot_setup_add.

在引导阶段结束时,网络代码可以使用该netdev_boot_setup_check函数来检查给定接口是否与引导时配置相关联。数组上的查找dev_boot_setup基于设备名称dev->name

At the end of the booting phase, the networking code can use the netdev_boot_setup_check function to check whether a given interface is associated with a boot-time configuration. The lookup on the array dev_boot_setup is based on the device name dev->name:

int netdev_boot_setup_check(struct net_device *dev)
{
    struct netdev_boot_setup *s = dev_boot_setup;
    整数我;

    for (i = 0; i < NETDEV_BOOT_SETUP_MAX; i++) {
        if (s[i].name[0] != '\0' && s[i].name[0] != ' ' &&
            !strncmp(dev->name, s[i].name, strlen(s[i].name))) {
            dev->irq = s[i].map.irq;
            dev->base_addr = s[i].map.base_addr;
            dev->mem_start = s[i].map.mem_start;
            dev->mem_end = s[i].map.mem_end;
            返回1;
        }
    }
    返回0;
}
int netdev_boot_setup_check(struct net_device *dev)
{
    struct netdev_boot_setup *s = dev_boot_setup;
    int i;

    for (i = 0; i < NETDEV_BOOT_SETUP_MAX; i++) {
        if (s[i].name[0] != '\0' && s[i].name[0] != ' ' &&
            !strncmp(dev->name, s[i].name, strlen(s[i].name))) {
            dev->irq        = s[i].map.irq;
            dev->base_addr  = s[i].map.base_addr;
            dev->mem_start  = s[i].map.mem_start;
            dev->mem_end    = s[i].map.mem_end;
            return 1;
        }
    }
    return 0;
}

ether=具有特殊功能、特性或限制的设备如果在和提供的基本参数之外还需要其他参数,则可以定义自己的关键字和处理程序netdev=(执行此操作的一个驱动程序是 PLIP)。

Devices with special capabilities, features, or limitations can define their own keywords and handlers if they need additional parameters on top of the basic ones provided by ether= and netdev= (one driver that does this is PLIP).

模块初始化代码

Module Initialization Code

因为下面几节中的示例经常引用模块,必须明确几个初始概念。

Because the examples in the following sections often refer to modules , a couple of initial concepts have to be made clear.

内核代码可以静态链接到主映像,也可以在需要时作为模块动态加载。并非所有内核组件都适合编译为模块。设备驱动程序和基本功能的扩展是通常编译为模块的内核组件的好例子。您可以参考Linux设备驱动程序详细讨论模块的优点和缺点,以及内核可以使用哪些机制在需要时动态加载它们并在不再需要时卸载它们。

Kernel code can be either statically linked to the main image or loaded dynamically as a module when needed. Not all kernel components are suitable to be compiled as modules. Device drivers and extensions to basic functionalities are good examples of kernel components often compiled as modules. You can refer to Linux Device Drivers for a detailed discussion of the advantages and disadvantages of modules, as well as the mechanisms that the kernel can use to dynamically load them when they are needed and unload them when they are no longer needed.

每个模块必须提供两个特殊函数,称为init_modulecleanup_module。第一个在模块加载时调用以初始化模块。第二个是在删除模块时由内核调用的,以释放已分配给模块使用的任何资源(包括内存)。

Every module must provide two special functions, called init_module and cleanup_module. The first one is called at module load time to initialize the module. The second one is invoked by the kernel when removing the module, to release any resources (memory included) that have been allocated for use by the module.

内核提供了两个宏module_initmodule_exit,允许开发人员为这两个例程使用任意名称。以下快照是drivers/net/3c59x.c以太网驱动程序的示例:

The kernel provides two macros, module_init and module_exit, that allow developers to use arbitrary names for the two routines. The following snapshot is an example from the drivers/net/3c59x.c Ethernet driver:

module_init(vortex_init);
module_exit(vortex_cleanup);
module_init(vortex_init);
module_exit(vortex_cleanup);

在“内存优化”部分中,我们将了解如何定义这两个宏以及它们的定义如何根据内核配置进行更改。大多数内核都使用这两个宏,但少数模块仍然使用旧的默认名称init_modulecleanup_module. 在本章的其余部分中,我将使用module_initmodule_exit来引用初始化和清理功能。

In the section "Memory Optimizations," we will see how those two macros are defined and how their definition can change based on the kernel configuration. Most of the kernel uses these two macros, but a few modules still use the old default names init_module and cleanup_module. In the rest of this chapter, I will use module_init and module_exit to refer to the initialization and cleanup functions.

让我们首先看看旧内核如何编写模块初始化代码,然后基于一组新宏的当前内核模型如何工作。

Let's first see how module initialization code used to be written with older kernels, and then how the current kernel model, based on a set of new macros, works.

旧模型:条件代码

Old Model: Conditional Code

无论内核组件是编译为模块还是静态构建到内核中,都需要对其进行初始化。因此,内核组件的初始化代码可能需要通过向编译器发出条件指令来区分这两种情况。#ifdef 在旧模型中,这迫使开发人员在任何地方都使用条件指令。

Regardless of whether a kernel component is compiled as a module or is built statically into the kernel, it needs to be initialized. Because of that, the initialization code of a kernel component may need to distinguish between the two cases by means of conditional directives to the compiler. In the old model, this forced developers to use conditional directives like #ifdef all over the place.

这是内核 2.2.14 的drivers/net/3c59x.c驱动程序的快照:注意使用了 多少次#ifdef MODULE和。#if defined (MODULE)

Here is a snapshot from the drivers/net/3c59x.c driver of kernel 2.2.14: note how many times #ifdef MODULE and #if defined (MODULE) are used.

...
#如果已定义(模块)&& LINUX_VERSION_CODE > 0x20115
MODULE_AUTHOR("唐纳德·贝克尔 <becker@cesdis.gsfc.nasa.gov>");
MODULE_DESCRIPTION("3Com 3c590/3c900 系列 Vortex/Boomerang 驱动程序");
MODULE_PARM(调试,“我”);
...
#万一
...
#ifdef 模块
...
int init_module(无效)
{
    ...
}
#别的
int tc59x_probe(结构设备*dev)
{
    ...
}
#endif /* 不是模块 */
...
static int vortex_scan(struct device *dev, struct pci_id_info pci_tbl[])
{
    ...
#如果已定义(CONFIG_PCI) || (定义(模块) && !定义(NO_PCI))
    ...
#ifdef 模块
    如果(compaq_ioaddr){
        vortex_probe1(0, 0, dev, compaq_ioaddr, compaq_irq,
                compaq_device_id、cards_found++);
        开发=0;
    }
#万一

    返回cards_found?0:-ENODEV;
}
...
#ifdef 模块
无效清理模块(无效)
{
    …………
}
#万一
...
#if defined(MODULE) && LINUX_VERSION_CODE > 0x20115
MODULE_AUTHOR("Donald Becker <becker@cesdis.gsfc.nasa.gov>");
MODULE_DESCRIPTION("3Com 3c590/3c900 series Vortex/Boomerang driver");
MODULE_PARM(debug, "i");
...
#endif
...
#ifdef MODULE
...
int init_module(void)
{
    ...
}
#else
int tc59x_probe(struct device *dev)
{
    ...
}
#endif  /* not MODULE */
...
static int vortex_scan(struct device *dev, struct pci_id_info pci_tbl[])
{
    ...
#if defined(CONFIG_PCI) || (defined(MODULE) && !defined(NO_PCI))
    ...
#ifdef MODULE
    if (compaq_ioaddr) {
        vortex_probe1(0, 0, dev, compaq_ioaddr, compaq_irq,
                compaq_device_id, cards_found++);
        dev = 0;
    }
#endif

    return cards_found ? 0 : -ENODEV;
}
...
#ifdef MODULE
void cleanup_module(void)
{
    ... ... ...
}
#endif

此快照显示了旧模型如何让程序员指定一些以不同方式完成的事情,具体取决于代码是编译为模块还是静态编译到内核映像中:

This snapshot shows how the old model let a programmer specify some of the things done differently, depending on whether the code is compiled as a module or statically into the kernel image:

初始化代码执行方式不同
The initialization code is executed differently

该快照显示,cleanup_module 仅当驱动程序编译为模块时才定义(并因此使用)例程。

The snapshot shows that the cleanup_module routine is defined (and therefore used) only when the driver is compiled as a module.

代码片段可以包含在模块中或从模块中排除
Pieces of code could be included or excluded from the module

例如,仅当驱动程序编译为模块时才vortex_scan调用。vortex_probe1

For example, vortex_scan calls vortex_probe1 only when the driver is compiled as a module.

这种模型使得源代码更难理解,因此也更难调试。而且,每个模块中都会重复相同的逻辑。

This model made source code harder to follow, and therefore to debug. Moreover, the same logic is repeated in every module.

新模型:基于宏的标记

New Model: Macro-Based Tagging

现在让我们将上一节中的快照与 2.6 内核中同一文件中的对应快照进行比较:

Now let's compare the snapshot from the previous section to its counterpart from the same file from a 2.6 kernel:

static char version[] _ _devinitdata = DRV_NAME " ... ";

静态结构 vortex_chip_info {
    ...
} vortex_info_tbl[] _ _devinitdata = {
    {“3c590 涡流 10Mbps”,
    …………
}

静态 int _ _init vortex_init (void)
{
    ...
}
静态无效__退出vortex_cleanup(无效)
{
    ...
}

module_init(vortex_init);
module_exit(vortex_cleanup);
static char version[] _ _devinitdata = DRV_NAME " ... ";

static struct vortex_chip_info {
    ...
} vortex_info_tbl[] _ _devinitdata = {
    {"3c590 Vortex 10Mbps",
    ... ... ...
}

static int _ _init vortex_init (void)
{
    ...
}
static void _ _exit vortex_cleanup (void)
{
    ...
}

module_init(vortex_init);
module_exit(vortex_cleanup);

您可以看到#ifdef不再需要指令。

You can see that #ifdef directives are no longer necessary.

为了消除条件代码带来的混乱,从而使代码更具可读性,内核开发人员引入了一组宏,模块开发人员现在可以使用它们来编写更清晰的初始化代码(大多数驱动程序都是使用这些宏的良好候选者)。刚刚显示的快照使用了其中的一些:_ _init_ _exit_ _devinitdata

To remove the mess introduced by conditional code, and therefore make code more readable, kernel developers introduced a set of macros that module developers now can use to write cleaner initialization code (most drivers are good candidates for the use of those macros). The snapshot just shown uses a few of them: _ _init, _ _exit, and _ _devinitdata.

后面的部分描述了一些新宏的使用方法以及它们的工作原理。

Later sections describe how some of the new macros are used and how they work.

这些宏允许内核在幕后确定每个模块的哪些代码应包含在内核映像中、哪些代码因不需要而被排除、哪些代码仅在初始化时执行等. 这消除了每个程序员在每个模块中复制相同逻辑的负担。[ * ]

These macros allow the kernel to determine behind the scenes, for each module, what code is to be included in the kernel image, what code is to be excluded because it is not needed, what code is to be executed only at initialization time, etc. This removes the burden from each programmer to replicate the same logic in each module.[*]

应该清楚的是,为了让这些宏允许程序员替换旧的条件指令,如上一节的示例所示,它们必须至少能够提供以下两个服务:

It should be clear that for these macros to allow programmers to replace the old conditional directives, as shown in the example of the previous section, they must be able to provide at least the following two services:

  • 定义启用新内核组件时需要执行的例程,因为它静态包含在内核中,或者因为它在运行时作为模块加载

  • Define routines that need to be executed when a new kernel component is enabled, either because it is statically included in the kernel or because it is loaded at runtime as a module

  • 定义初始化函数之间的某种顺序,以便可以尊重和强制执行内核组件之间的依赖关系

  • Define some kind of order between initialization functions so that dependencies between kernel components can be respected and enforced

优化的基于宏的标记

Optimized Macro-Based Tagging

Linux 内核使用各种不同的宏来标记具有特殊属性的函数和数据结构:例如,标记初始化例程。大多数宏都在include/linux/init.h中定义。其中一些宏告诉链接器将具有公共属性的代码或数据结构也放入特定的专用内存区域(内存部分)中。通过这样做,内核可以更轻松地以简单的方式处理具有公共属性的整个对象类(例程或数据结构)。我们将在“内存优化”部分看到一个示例。

The Linux kernel uses a variety of different macros to mark functions and data structures with special properties: for instance, to mark an initialization routine. Most of those macros are defined in include/linux/init.h. Some of those macros tell the linker to place code or data structures with common properties into specific, dedicated memory areas (memory sections) as well. By doing so, it becomes easier for the kernel to take care of an entire class of objects (routines or data structures) with a common property in a simple manner. We will see an example in the section "Memory Optimizations."

图 7-3显示了一些内核内存部分:左侧是界定每个区域部分的开头和结尾的指针名称(当有意义时)。

Figure 7-3 shows some of the kernel memory sections: on the left side are the names of the pointers that delimit the beginning and the end of each area section (when meaningful).

初始化代码使用的一些内存部分

图 7-3。初始化代码使用的一些内存部分

Figure 7-3. Some of the memory sections used by initialization code

右侧是用于将数据和代码放入相关部分的宏的名称。该图不包括所有内存部分和相关宏;太多了,无法方便地列出。

On the right side are the names of the macros used to place data and code into the associated sections. The figure does not include all the memory sections and associated macros; there are too many to list conveniently.

表 7-17-2分别列出了一些用于标记例程和数据结构的宏以及简要说明。由于篇幅有限,我们不会查看所有这些宏,但我们将在“ xxx_initcall 宏”部分以及“ __init 和 __exit 宏”部分中花几句话来介绍 宏。xxx _initcall_ _init_ _exit

Tables 7-1 and 7-2 list some of the macros used to tag routines and data structures, respectively, along with a brief description. We will not look at all of them for lack of space, but we will spend a few words on the xxx _initcall macros in the section "xxx_initcall Macros" and on _ _init and _ _exit in the section "_ _init and _ _exit Macros."

本节的目的不是描述如何构建内核映像、如何处理模块等,而是为您提供一些关于这些宏存在的原因以及设备驱动程序最常用的宏如何工作的提示。

The purpose of this section is not to describe how the kernel image is built, how modules are handled, etc., but rather to give you just a few hints about why those macros exist, and how the ones most commonly used by device drivers work.

表 7-1。例程宏

Table 7-1. Macros for routines

Macro

使用宏的例程类型

Kind of routines the macro is used for

a_ _exitcall 和定义在和_ _initcall之上 。_ _exit_call_ _init_call

a_ _exitcall and _ _initcall are defined on top of _ _exit_call and _ _init_call.

_ _init

_ _init

启动时初始化例程:用于启动阶段结束时不再需要的例程。

Boot-time initialization routine: for routines that are not needed anymore at the end of the boot phase.

该信息可用于在某些情况下摆脱例程(请参阅后面的“内存优化”部分)。

This information can be used to get rid of the routine under some conditions (see the later section "Memory Optimizations").

_ _exit

_ _exit

对应于_ _init. 当关联的内核组件关闭时调用。常用于标记module_exit函数。

Counterpart to _ _init. Called when the associated kernel component is shut down. Often used to mark module_exit functions.

该信息可用于在某些情况下摆脱例程(请参阅后面的“内存优化”部分)。

This information can be used to get rid of the routine under some conditions (see the later section "Memory Optimizations").

core_initcall

core_initcall

postcore_initcall

postcore_initcall

arch_initcall

arch_initcall

subsys_initcall

subsys_initcall

fs_initcall

fs_initcall

device_initcall

device_initcall

late_initcall

late_initcall

一组宏,按优先级降序列出,用于标记需要在启动时执行的初始化例程。请参阅后面的“ xxx_initcall 宏”部分。

Set of macros, listed in decreasing order of priority, used to tag initialization routines that need to be executed at boot time. See the later section "xxx_initcall Macros."

_ _initcall

_ _initcall

过时的宏,定义为device_initcall. 请参阅后面的“遗留代码”部分。

Obsolete macro, defined as an alias to device_initcall. See the later section "Legacy code."

_ _exitcall A

_ _exitcall a

一次性退出函数,当关联的内核组件关闭时调用。到目前为止,它仅用于标记module_exit例程。请参阅后面的“内存优化”部分。

One-shot exit function, called when the associated kernel component is shut down. So far, it has been used only to mark module_exit routines. See the later section "Memory Optimizations."

表 7-2。用于初始化数据结构的宏

Table 7-2. Macros for initialized data structures

Macro

宏所使用的数据类型

Kind of data the macro is used for

_ _initdata

_ _initdata

仅在启动时使用的初始化数据结构。

Initialized data structure used at boot time only.

_ _exitdata

_ _exitdata

仅由标记为 的例程使用的数据结构_ _exitcall。因此,如果_ _exitcall不使用标记为 的例程,则标记为 的数据也是如此_ _exitdata。因此,相同类型的优化可以应用于_ _exitdata_ _exitcall

Data structure used only by routines tagged with _ _exitcall. It follows that if a routine tagged with _ _exitcall is not going to be used, the same is true of data tagged with _ _exitdata. The same kind of optimization can therefore be applied to _ _exitdata and _ _exitcall.

在我们详细介绍表 7-17-2中列出的一些宏之前,值得强调以下几点:

Before we go into some more detail on a few of the macros listed in Tables 7-1 and 7-2, it is worth stressing the following points:

  • 大多数宏都是成对出现的:一个(或一组)负责初始化,一个姐妹宏(或一组姐妹)负责删除。例如,_ _exit_ _init的妹妹;_ _exitcalls_ _initcall的妹妹等

  • Most macros come in couples: one (or a set of them) takes care of initialization, and a sister macro (or a sister set) takes care of removal. For example, _ _exit is _ _init's sister; _ _exitcalls is _ _initcall's sister, etc.

  • 宏负责两点(其中之一,而不是两者):一是何时执行例程(即,_ _initcall_ _exitcall);另一个是要放置例程或数据结构的内存部分(即,_ _init_ _exit)。

  • Macros take care of two points (one or the other, not both): one is when a routine is to be executed (i.e., _ _initcall, _ _exitcall); the other is the memory section a routine or a data structure is to be placed in (i.e., _ _init, _ _exit).

  • 同一例程可以用多个宏来标记。例如,以下快照表示pci_proc_init将在引导时运行 ( _ _initcall),并且一旦执行即可释放 ( _ _init):

    静态 int _ _init pci_proc_init(void)
    {
    ...
    }
    
    _ _initcall(pci_proc_init)
  • The same routine can be tagged with more than one macro. For example, the following snapshot says that pci_proc_init is to be run at boot time (_ _initcall), and can be freed once it is executed (_ _init):

    static int _ _init pci_proc_init(void)
    {
    ...
    }
    
    _ _initcall(pci_proc_init)

设备初始化例程的初始化宏

Initialization Macros for Device Initialization Routines

表 7-3列出了一组常用于标记设备驱动程序用于初始化其设备的例程的宏,当内核不支持热插拔时,这些宏可以引入内存优化。在第 6 章的“ PCI NIC 驱动程序注册示例”部分中,您可以找到它们的使用示例。在后面的“其他优化”部分中,您可以看到表 7-3中的宏何时有助于内存优化。

Table 7-3 lists a set of macros commonly used to tag routines used by device drivers to initialize their devices, and that can introduce memory optimizations when the kernel does not have support for Hotplug. In the section "Example of PCI NIC Driver Registration" in Chapter 6, you can find an example of their use. In the later section "Other Optimizations," you can see when the macros in Table 7-3 facilitate memory optimizations.

表 7-3。设备初始化例程的宏

Table 7-3. Macros for device initialization routines

姓名

Name

描述

Description

_ _devinit

_ _devinit

用于标记初始化设备的例程。

Used to tag routines that initialize a device.

例如,对于 PCI 驱动程序,初始化的例程pci_driver->probe用此宏标记。

For instance, for a PCI driver, the routine to which pci_driver->probe is initialized is tagged with this macro.

由另一个标记为 的例程专门调用的例程 通常也_ _devinit标记为 。_ _devinit

Routines that are exclusively invoked by another routine tagged with _ _devinit are commonly tagged with _ _devinit as well.

_ _devexit

_ _devexit

用于标记移除设备时要调用的例程。

Used to tag routines to be invoked when a device is removed.

_ _devexit_p

_ _devexit_p

用于初始化指向标有 的例程的指针_ _devexit

Used to initialize pointers to routines tagged with _ _devexit.

_ _devexit_p(fn)fn如果内核同时支持模块和 Hotplug,则返回,否则返回 NULL。参见后面的“其他优化”部分。

_ _devexit_p(fn) returns fn if the kernel supports both modules and Hotplug, and returns NULL otherwise. See the later section "Other Optimizations."

_ _devinitdata

_ _devinitdata

用于标记初始化的数据结构,这些数据结构由负责设备初始化的函数使用(即用 标记_ _devinit),因此共享其属性 。

Used to tag initialized data structures that are used by functions that take care of device initialization (i.e., are tagged with _ _devinit), and that therefore share their properties .

_ _devexitdata

_ _devexitdata

与 相同_ _devinitdata但与之相关_ _devexit

Same as _ _devinitdata but associated with _ _devexit.

启动时初始化例程

Boot-Time Initialization Routines

大多数初始化例程都有两个有趣的属性:

Most initialization routines have two interesting properties:

  • 它们需要在引导时执行,此时所有内核组件都已初始化。

  • They need to be executed at boot time, when all the kernel components get initialized.

  • 一旦执行就不再需要它们。

  • They are not needed once they are executed.

下一节“ xxx_initcall 宏”描述了用于在引导时运行初始化例程的机制,同时考虑了这些属性以及模块之间的优先级。后面的“内存优化”部分展示了如何使用智能标记在链接时或运行时释放不再需要的例程和数据结构。

The next section, "xxx_initcall Macros," describes the mechanism used to run initialization routines at boot time, taking into consideration these properties as well as priorities among modules. The later section "Memory Optimizations" shows how routines and data structures that are no longer needed can be freed at link time or runtime by using smart tagging.

xxx_initcall 宏

xxx_initcall Macros

内核启动的早期阶段由两个主要的初始化块组成:

The early phase of the kernel boot consists of two main blocks of initializations:

  • 需要按特定顺序完成的各种关键和强制子系统的初始化。例如,在 PCI 层初始化之前,内核无法初始化 PCI 设备驱动程序。另一个示例请参见后面的“初始化例程之间的依赖关系示例”部分。

  • The initialization of various critical and mandatory subsystems that need to be done in a specific order. For instance, the kernel cannot initialize a PCI device driver before the PCI layer has been initialized. See the later section "Example of dependency between initialization routines" for another example.

  • 其他内核组件的初始化不需要严格的顺序:相同优先级的例程可以按任意顺序运行。

  • The initialization of other kernel components that do not need a strict order: routines in the same priority level can be run in any order.

第一部分由第 5 章图 5-1do_initcalls中的代码负责。第二部分是通过调用同一图中接近末尾处所示的来处理的。第二部分的初始化例程根据其角色和优先级进行分类。内核从最高优先级 ( ) 中的初始化例程开始,一一执行这些初始化例程。调用这些例程所需的地址通过使用表 7-1中的宏之一进行标记而 放置在图 7-3的内存部分中。do_initcallsdo_basic_setupcore_initcall.initcall N .initxxx _initcall

The first part is taken care of by the code that comes before do_initcalls in Figure 5-1 in Chapter 5. The second part is taken care of by the invocation of do_initcalls shown close to the end of do_basic_setup in the same figure. The initialization routines of this second part are classified based on their role and priority. The kernel executes those initialization routines one by one, starting from the ones placed in the highest-priority class (core_initcall). The addresses of those routines, which are needed to invoke them, are placed in the .initcall N .init memory sections of Figure 7-3 by tagging them with one of the xxx _initcall macros in Table 7-1.

用于存储用宏标记的例程的地址的区域 由起始地址( )和结束地址( )分隔。在下面的函数摘录中,您可以看到它只是从该区域中一一获取函数地址并执行它们指向的函数:xxx _initcall_ _initcall_start_ _initcall_enddo_initcalls

The area used to store the addresses of the routines tagged with the xxx _initcall macros is delimited by a starting address (_ _initcall_start) and an ending address (_ _initcall_end). In the excerpt of the do_initcalls function that follows, you can see that it simply takes the function addresses one by one from that area and executes the functions they point to:

静态无效__init do_initcalls(无效)
{
        initcall_t *调用;
        int count = preempt_count( );

        for (call = _ _initcall_start; call < _ _initcall_end; call++) {
            …………
            (*称呼)( );
            …………
        }
        刷新_scheduled_work();
}
static void _ _init do_initcalls(void)
{
        initcall_t *call;
        int count = preempt_count( );

        for (call = _ _initcall_start; call < _ _initcall_end; call++) {
            ... ... ...
            (*call)( );
            ... ... ...
        }
        flush_scheduled_work( );
}

调用的例程do_initcalls不应更改抢占状态或禁用 IRQ。因此,在每次例程执行后,do_initcalls都会检查例程是否进行了任何更改,并在必要时调整抢占和 IRQ 状态(上一个快照中未显示)。

The routines invoked by do_initcalls are not supposed to change the preemption status or disable IRQs. Because of that, after each routine execution, do_initcalls checks whether the routine has made any changes, and adjusts the preemption and IRQ status if necessary (not shown in the previous snapshot).

例程可以安排一些稍后进行的工作。这意味着这些例程处理的任务可能会在未知时间异步终止。调用用于等待这些异步任务完成后再返回。xxx _initcallflush_scheduled_workdo_initcalls

It is possible for the xxx _initcall routines to schedule some work that takes place later. This means that the tasks handled by those routines may terminate asynchronously, at unknown times. The call to flush_scheduled_work is used to make do_initcalls wait for those asynchronous tasks to complete before returning.

请注意,do_initcalls它本身被标记为 _ _init:,因为它在启动阶段仅使用一次 do_basic_setup,一旦启动阶段完成,内核就可以丢弃它。

Note that do_initcalls itself is marked with _ _init: because it is used only once within do_basic_setup during the booting phase, the kernel can discard it once the latter is done.

_ _exitcall是 的对应项_ _initcall。它不直接使用,而是通过定义为它的别名的其他宏来使用,例如我们在“模块初始化代码module_exit”部分中介绍的。

_ _exitcall is the counterpart of _ _initcall. It is not used much directly, but rather via other macros defined as aliases to it, such as module_exit, which we introduced in the section "Module Initialization Code."

__initcall 和 __exitcall 例程示例:模块

Example of _ _initcall and _ _exitcall routines: modules

我在“模块初始化代码” 部分中说过,module_initmodule_exit宏分别用于标记模块初始化(如果内置到内核中则在引导时,如果单独加载则在运行时)和卸载时要执行的例程。

I said in the section "Module Initialization Code" that the module_init and module_exit macros, respectively, are used to tag routines to be executed when the module is initialized (either at boot time if built into the kernel or at runtime if loaded separately) and unloaded.

这使得模块成为我们_ _initcall_ _exitcall宏的完美候选者:根据我刚才所说的,宏的以下定义来自include/linux/init.hmodule_initmodule_exit不应感到惊讶:

This makes a module the perfect candidate for our _ _initcall and _ _exitcall macros: in light of what I just said, the following definition from include/linux/init.h of the macros module_init and module_exit should not come as a surprise:

#ifndef 模块
…………
#define module_init(x) __initcall(x);
#define module_exit(x) _ _exitcall(x);

#别的
…………
#万一
#ifndef MODULE
... ... ...
#define module_init(x)    _ _initcall(x);
#define module_exit(x)    _ _exitcall(x);

#else
... ... ...
#endif

module_init_ _initcall被定义为静态链接到内核的代码的别名:其输入函数被归类为启动时初始化例程。

module_init is defined as an alias to _ _initcall for code statically linked to the kernel: its input function is classified as a boot-time initialization routine.

module_exit遵循相同的方案:当代码内置到内核中时,module_exit成为关闭例程。目前,当系统出现故障时,不会调用关闭例程,但代码已到位,允许这样做。[ * ]

module_exit follows the same scheme: when the code is built into the kernel, module_exit becomes a shutdown routine. At the moment, shutdown routines are not called when the system goes down, but the code is in place to allow it.[*]

初始化例程之间的依赖关系示例

Example of dependency between initialization routines

net_dev_init第 5 章介绍了。设备驱动程序通过其例程向内核注册,如第 6 章“总体情况module_init部分所述,该例程通过网络代码注册其设备。内置驱动程序的这两个函数 和各种函数均在引导时由. 因此,内核需要确保在执行之前没有发生设备注册。由于使用宏(或其别名, net_dev_initmodule_initdo_initcallsnet_dev_initdevice_initcall_ _initcall),同时net_dev_init标有subsys_initcall。在图7-3中,您可以看到subsys_initcall 例程比例程更早执行device_initcall (内存部分按优先级顺序排序)。

net_dev_init was introduced in Chapter 5. Device drivers register with the kernel with their module_init routine, which, as described in the section "The Big Picture" in Chapter 6, registers its devices with the networking code. Both net_dev_init and the various module_init functions for built-in drivers are invoked at boot time by do_initcalls. Because of that, the kernel needs to make sure no device registrations take place before net_dev_init has been executed. This is enforced transparently thanks to the marking of device driver initialization routines with the macro device_initcall (or its alias, _ _initcall), while net_dev_init is marked with subsys_initcall. In Figure 7-3, you can see that subsys_initcall routines are executed earlier than device_initcall routines (the memory sections are sorted in priority order).

遗留代码

Legacy code

在引入这组 宏之前,只有一个宏来标记初始化函数:。仅使用单个宏会产生严重的限制:仅通过用宏标记例程就无法强制执行执行顺序。在许多情况下,由于模块间依赖性和其他考虑因素,此限制是不可接受的。因此,不能将 的使用 扩展到所有初始化函数。xxx_initcall_ _initcall_ _initcall

Before the introduction of the set of xxx_initcall macros, there was only one macro to mark initialization functions: _ _initcall. The use of only a single macro created a heavy limitation: no execution order could be enforced by simply marking routines with the macro. In many cases, this limitation is not acceptable due to intermodule dependencies, and other considerations. Therefore, the use of _ _initcall could not be extended to all the initialization functions.

_ _initcall过去主要由设备驱动程序使用。为了向后兼容尚未更新到新模型的代码片段,它仍然存在,并且简单地定义为device_initcall.

_ _initcall used to be employed mostly by device drivers. For backward compatibility with pieces of code not yet updated to the new model, it still exists and is simply defined as an alias to device_initcall.

当前模型中仍然存在的另一个限制是无法向初始化例程提供任何参数。然而,这似乎并不是一个重要的限制。

Another limitation, which is still present in the current model, is that no parameters can be provided to the initialization routines. However, this does not seem to be an important limitation.

内存优化

Memory Optimizations

与用户空间代码和数据不同,内核代码和数据永久驻留在主内存中,因此以各种可能的方式减少内存浪费非常重要。初始化代码是内存优化的良好候选者。鉴于其性质,大多数初始化例程要么只执行一次,要么根本不执行,具体取决于内核配置。例如:

Unlike user-space code and data, kernel code and data reside permanently in main memory, so it is important to reduce memory waste in every way possible. Initialization code is a good candidate for memory optimization . Given their nature, most initialization routines are executed either just once or not at all, depending on the kernel configuration. For example:

  • 这些module_init例程仅在关联模块加载时执行一次。当模块静态包含在内核中时,内核可以module_init在启动时、运行后立即释放例程。

  • The module_init routines are executed only once when the associated module is loaded. When the module is statically included in the kernel, the kernel can free the module_init routine right at boot time, after it runs.

  • module_exit当关联的模块静态地包含在内核中时,这些例程永远不会被执行。因此,在这种情况下,不需要将module_exit 例程包含到内核映像中(即可以在链接时丢弃例程)。

  • The module_exit routines are never executed when the associated modules are included statically in the kernel. In this case, therefore, there is no need to include module_exit routines into the kernel image (i.e., the routines can be discarded at link time).

第一种情况是运行时优化,第二种情况是链接时优化。

The first case is a runtime optimization, and the second one is a link-time optimization.

仅在引导期间使用且此后不再需要的代码和数据被放置在图 7-3所示的存储器部分之一中。一旦内核完成初始化阶段,它就可以丢弃整个内存区域。这是通过调用[ * ]free_init_mem来完成的,如第 5 章中的 图 5-1所示。不同的宏用于将代码放入图 7-3的不同内存部分。

Code and data that are used only during the boot and are not needed thereafter are placed in one of the memory sections shown in Figure 7-3. Once the kernel has completed the initialization phase, it can discard that entire memory area. This is accomplished by the call to free_init_mem,[*] as shown in Figure 5-1 in Chapter 5. Different macros are used to place code into the different memory sections of Figure 7-3.

如果您查看前面部分“新模型:基于宏的标记”中的示例,您可以看到module_init和 的两个输入例程module_exit(通常)分别用_ _init_ _exit标记:这样做正是为了利用这两个例程本节开头提到的属性。

If you look at the example in the earlier section "New Model: Macro-Based Tagging," you can see that the two input routines to module_init and module_exit are (usually) tagged with _ _init and _ _exit, respectively: this is done precisely to take advantage of the two properties mentioned at the start of this section.

_ _init 和 _ _exit 宏

_ _init and _ _exit Macros

在内核早期阶段执行的初始化例程用宏标记_ _init

The initialization routines executed in the early phase of the kernel are tagged with the macro _ _init.

正如上一节中提到的,大多数module_init输入例程都标有此宏。例如,第 5 章图 5-1中的大多数函数(在调用 之前)都标有。free_initmem_ _init

As mentioned in the previous section, most module_init input routines are tagged with this macro. For example, most of the functions in Figure 5-1 in Chapter 5 (before the call to free_initmem) are marked with _ _init.

正如其定义所示,该_ _init 宏将输入例程放入.text.init 内存部分:

As shown by its definition here, the _ _init macro places the input routine into the .text.init memory section:

#define __init __attribute__ ((__section__(".text.init")))
#define _ _init    _ _attribute_ _ ((_ _section_ _ (".text.init")))

该部分是运行时释放的内存区域之一free_initmem

This section is one of the memory areas freed at runtime by free_initmem.

_ _exit是 的对应项_ _init。用于关闭模块的例程被放置在该.text.exit部分中。对于构建到内核中的模块,可以在链接时直接丢弃此部分。然而,一些架构放弃了它的运行时来处理交叉引用。请注意,当内核不支持模块卸载时,对于单独加载的模块,可以在加载时删除同一部分。(有一个内核选项可以阻止用户卸载模块。)

_ _exit is the counterpart of _ _init. Routines used to shut down a module are placed into the .text.exit section. This section can be discarded at link time directly for modules build into the kernel. However, a few architectures discard it a runtime to deal with cross-references. Note that the same section, for modules loaded separately, can be removed at load time when the kernel does not support module unloading. (There is a kernel option that keeps the user from unloading modules.)

xxx_initcall 和 _ _exitcall 部分

xxx_initcall and _ _exitcall Sections

内核将地址放置到用和宏标记的例程的内存部分也将被丢弃:xxx _initcall_ _exitcall

The memory sections where the kernel places the addresses to the routines tagged with the xxx _initcall and _ _exitcall macros are also discarded:

  • 图 7-3中所示的部分在运行时被 丢弃 。xxx _initcallfree_initmem

  • The xxx _initcall sections shown in Figure 7-3 are discarded at runtime by free_initmem.

  • .text.exit用于函数的部分在_ _exitcall链接时被丢弃,因为现在内核不会_ _exitcall在系统关闭时调用例程(即,它不使用类似于 的机制do_initcalls)。

  • The .text.exit section used for _ _exitcall functions is discarded at link time because right now the kernel does not call the _ _exitcall routines on system shutdown (i.e., it does not use a mechanism similar to do_initcalls).

其他优化

Other Optimizations

其他优化示例包括表 7-3中的宏:

Other examples of optimizations include the macros in Table 7-3:

_ _devinit
_ _devinit

当内核编译时不支持热插拔时, _ _devinit在引导阶段结束时(在所有设备都已初始化之后)不再需要标记为 的例程。正因为如此,当不支持 Hotplug 时,_ _devinit就成为_ _init.

When the kernel is not compiled with support for Hotplug, routines tagged with _ _devinit are not needed anymore at the end of the boot phase (after all the devices have been initialized). Because of this, when there is no support for Hotplug, _ _devinit becomes an alias to _ _init.

_ _devexit
_ _devexit

当将 PCI 驱动程序内置到不支持热插拔的内核中时,可以将pci_driver->remove其初始化并标记为 的例程_ _devexit丢弃,因为不需要它。当模块单独加载到不支持模块卸载的内核时,该例程也可以被丢弃。

When a PCI driver is built into a kernel without support for Hotplug, the routine to which pci_driver->remove is initialized, and which is tagged with _ _devexit, can be discarded because it is not needed. The routine can be discarded also when the module is loaded separately into a kernel that does not have support for module unloading.

_ _devinitdata
_ _devinitdata

当不支持热插拔时,也仅在启动时需要此数据。pci_driver-> probe通常,设备驱动程序使用此宏来标记函数在初始化设备时打印的横幅字符串 。例如,PCI 驱动程序将pci_device_id表标记为_ _devinitdata:一旦系统完成启动并且不支持热插拔,内核就不再需要这些表。

When there is no support for Hotplug, this data too is needed only at boot time. Normally, device drivers use this macro to tag the banner strings that the pci_driver-> probe functions print when initializing a device. PCI drivers, for instance, tag the pci_device_id tables with _ _devinitdata: once the system has finished booting and there is no support for Hotplug, the kernel does not need the tables anymore.

本节仅提供了一些删除代码的示例。您可以通过浏览源代码了解更多信息,例如,从/DISCARD/部分的体系结构相关定义开始。

This section has given you only a few examples of removing code. You can learn more by browsing the source code, starting, for instance, from the architecture-dependent definitions of the /DISCARD/ section.

动态宏的定义

Dynamic Macros' Definition

在前面的章节中,我介绍了一些宏,例如_ _init和 的各个版本。我们还看到传递给宏的例程带有诸如 之类的宏标记。由于大多数内核组件可以编译为模块或静态链接到内核,因此所做的选择会更改这些宏的定义以应用上一节中介绍的内存优化。xxx _initcallmodule_init_ _initcall

In the previous sections, I introduced a few macros, such as _ _init and the various versions of xxx _initcall. We have also seen that the routines passed to the module_init macro are tagged with macros such as _ _initcall. Because most kernel components can be either compiled as modules or statically linked to the kernel, the choice made changes the definitions of these macros to apply the memory optimizations introduced in the previous section.

特别是,表 7-1中的宏定义(如您在include/linux/init.h中看到的那样)根据以下符号是否在包含include/linux/init的文件范围内定义而变化。小时

In particular, the definition of the macros in Table 7-1, as you can see in include/linux/init.h, change depending on whether the following symbols are defined within the scope of the file that includes include/linux/init.h:

CONFIG_MODULE
CONFIG_MODULE

当内核中支持可加载模块时定义(“可加载模块支持”配置选项)

Defined when there is support for a loadable module in the kernel (the "Loadable module support" configuration option)

MODULE
MODULE

当文件所属的内核组件编译为模块时定义

Defined when the kernel component that the file belongs to is compiled as a module

CONFIG_HOTPLUG
CONFIG_HOTPLUG

在编译内核时定义“支持热插拔设备”(“常规设置”菜单中的选项)

Defined when the kernel is compiled with "Support for hot-pluggable devices" (an option in the "General setup" menu)

虽然MODULE不同的文件可以有不同的值,但其他两个符号是内核范围的属性,因此在整个内核中要么设置一致,要么不设置一致。

While MODULE can have different values for different files, the other two symbols are kernel-wide properties and therefore are either set or not set consistently throughout a kernel.

在表7-1 和表7-2的宏中,从网卡驱动初始化的角度,我们最感兴趣的是以下几个宏:_ _init_ _exit_ _initcall_ _exitcall。总结到目前为止所讨论的内容,图 7-4显示了前面列表中的宏在节省内存方面的有效性,基于符号MODULE和是否CONFIG_HOTPLUG 已定义(假设内核支持可加载模块,即 CONFIG_MODULE被定义为)。从图中可以看出,当内核不支持可加载模块和热插拔时,与支持这两个选项时相比,会发生很多事情:限制越多,获得的优化就越多。

Among the macros in Tables 7-1 and 7-2, we are mostly interested in the following ones from the perspective of NIC driver initialization: _ _init, _ _exit, _ _initcall, and _ _exitcall. Summarizing what was discussed so far, Figure 7-4 shows the effectiveness of the macros in the previous list in saving memory, based on whether the symbols MODULE and CONFIG_HOTPLUG are defined (let's suppose the kernel had support for loadable modules—i.e., that CONFIG_MODULE is defined). As you can see from the figure, there is a lot going on when the kernel does not have support for both loadable modules and Hotplug, compared to when both of those options are supported: the more restrictions you have, the more optimizations you get.

表 7-1 中宏的作用,后面是文本中的编号列表

图 7-4。表 7-1中宏的作用,后面是文本中的编号列表

Figure 7-4. Effect of macros in Table 7-1, following numbered lists in text

让我们一一了解图 7-4中第 1 点到第 6 点的含义,同时牢记前面“新模型:基于宏的标记_ _initcall”一节中所示的设备驱动程序的通用结构以及和_ _exitcall的定义我们在前面的“内存优化”部分中看到过。

Let's see one by one the meaning of the points 1 through 6 in Figure 7-4, keeping in mind the generic structure of a device driver as shown earlier in the section "New Model: Macro-Based Tagging" and the definitions of _ _initcall and _ _exitcall that we saw earlier in the section "Memory Optimizations."

以下是将模块编译为内核一部分时可以应用的优化:

Here are the optimizations that can be applied when compiling a module as part of the kernel:

  1. module_exit例程永远不会被使用;因此,通过用 标记它们_ _exit,程序员可以确保它们在链接时不会包含在图像中。

  2. module_exit routines will never be used; so by tagging them with _ _exit, the programmer makes sure they will not be included in the image at link time.

  3. module_init例程只会在启动时执行一次,因此通过用 标记它们_ _init,程序员可以让它们在执行后被丢弃。

  4. module_init routines will be executed only once at boot time, so by tagging them with _ _init, the programmer lets them be discarded once they are executed.

  5. module_init(fn)成为 的别名_ _initcall(fn),这确保fn将由 执行,正如我们在“ xxx_initcall 宏do_initcalls”部分中看到的那样。

  6. module_init(fn) becomes an alias to _ _initcall(fn), which makes sure fn will be executed by do_initcalls, as we saw in the section "xxx_initcall Macros."

  7. module_exit(fn)成为 的别名_ _exitcall(fn)。这会将输入函数的地址放在.exitcall.exit内存部分中,这使得内核在关闭时更容易运行它,但该部分实际上在链接时被丢弃。

    让我们以 PCI 设备为参考,看看不支持 Hotplug 还会带来哪些其他优化。这些涉及到pci_driver->remove函数,当模块被卸载时,该函数会为该模块注册的每个设备调用一次(请参阅第 6 章中的“总体情况”部分)。

  8. module_exit(fn) becomes an alias to _ _exitcall(fn). This places the address to the input function in the .exitcall.exit memory section, which makes it easier for the kernel to run it at shutdown time, but the section is actually discarded at link time.

    Let's use PCI devices as a reference, and see what other optimizations the lack of support for Hotplug introduces. These concern the pci_driver->remove function, which is called when a module is unloaded, once for each device registered by that module (see the section "The Big Picture" in Chapter 6).

  9. 无论是否MODULE定义,当内核中不支持 Hotplug 时,无法从正在运行的系统中删除设备。因此,该remove函数永远不会被 PCI 层调用,并且可以初始化为 NULL 指针。这是由_ _devexit_p 宏指示的。

  10. Regardless of whether MODULE is defined, when there is no support for Hotplug in the kernel, devices cannot be removed from a running system. Therefore, the remove function will never be invoked by the PCI layer and can be initialized to a NULL pointer. This is indicated by the _ _devexit_p macro.

  11. 当内核中不支持热插拔或模块时,模块pci_driver->remove不需要用于初始化的驱动程序例程。这是由_ _devexit宏指示的。请注意,当支持模块时,情况并非如此。由于允许用户加载和卸载模块,因此内核需要该remove 例程。

  12. When there is no support for Hotplug or for modules in the kernel, the driver's routine that would be used to initialize pci_driver->remove is not needed by the module. This is indicated by the _ _devexit macro. Note that this is not true when there is support for modules. Because a user is allowed to load and unload a module, the kernel needs the remove routine.

请注意,第 5 点是第 6 点的结果:如果您不在内核中包含例程,则无法引用它(即,无法初始化指向该例程的函数指针)。[ * ]

Note that point 5 is a consequence of point 6: if you do not include a routine in the kernel, you cannot refer to it (i.e., you cannot initialize a function pointer to that routine).[*]

通过 /proc 文件系统进行调整

Tuning via /proc Filesystem

就本章而言,/proc中没有感兴趣的文件。

There is no file of interest in /proc as far as this chapter is concerned.

本章介绍的函数和变量

Functions and Variables Featured in This Chapter

表 7-4总结了本章中介绍的函数、宏、结构体和变量。

Table 7-4 summarizes the functions, macros, structures, and variables introduced in the chapter.

表 7-4。本章介绍的函数、宏、变量和数据结构

Table 7-4. Functions, macros, variables, and data structures introduced in this chapter

姓名

Name

描述

Description

函数和宏

Functions and macros

 

_ _init,,,,,,,,,,,,,,,, _ _exit_ _ _initcall_ _ _exitcall_ _ _initdata_ _ _exitdata_ _ _devinit_ _ _devexit_ _ _devexit_p_ _ _devinitdata_ _ _devexitdata_xxx_initcall

_ _init, _ _exit, _ _initcall, _ _exitcall, _ _initdata, _ _exitdata, _ _devinit, _ _devexit, _ _devexit_p, _ _devinitdata, _ _devexitdata, xxx_initcall

宏用于标记具有特殊特征的代码片段。例如,这些标签可用于优化内核映像大小,省略不需要的代码。

Macros used to tag pieces of code with special characteristics. These tags can be used to optimize the kernel image size, leaving out unneeded code, for instance.

do_initcalls

do_initcalls

在启动时执行所有用宏标记的函数 。xxx _initcall

Executes at boot time all the functions tagged with the xxx _initcall macros.

init_module, cleanup_module, module_init, module_exit

init_module, cleanup_module, module_init, module_exit

前两个是每个模块应提供的函数名称,用于分别初始化和删除模块。另外两个是宏,允许设备驱动程序编写者为前两个例程使用任意名称。

The first two are the names of the functions that each module should provide to respectively initialize and remove a module. The other two are macros that allow device driver writers to use an arbitrary name for the previous two routines.

netdev_boot, setup_check,netdev_boot_setup_add

netdev_boot, setup_check, netdev_boot_setup_add

将启动时配置(如果有)应用到特定设备。

Apply the boot-time configuration (if any) to a specific device.

module_param

module_param

定义加载模块时可以提供的可选模块参数。

Defines optional module parameters that can be provided when loading the module.

数据结构

Data structures

 

kernel_param

kernel_param

存储宏的输入module_param

Stores the input to the module_param macro.

obs_kernel_param

obs_kernel_param

存储宏的输入_ _setup

Stores the input to the _ _setup macro.

netdev_boot_setup,ifmap

netdev_boot_setup, ifmap

netdev_boot_setupether=存储和选项的启动时参数netdev=

netdev_boot_setup stores boot-time parameters for the ether= and netdev= options.

ifmap是 的领域之一netdev_boot_setup

ifmap is one of the fields of netdev_boot_setup.

变量

Variables

 

dev_boot_setup

dev_boot_setup

结构体数组netdev_boot_setup

Array of netdev_boot_setup structures.

NETDEV_BOOT_SETUP_MAX

NETDEV_BOOT_SETUP_MAX

大小为dev_boot_setup.

Size of dev_boot_setup.

本章介绍的文件和目录

Files and Directories Featured in This Chapter

图 7-5列出了本章中提到的文件和目录。

Figure 7-5 lists the files and directories referred to in the chapter.

本章介绍的文件和目录

图 7-5。本章介绍的文件和目录

Figure 7-5. Files and directories featured in this chapter




[ * ]您可以在Linux BootPrompt HOWTO中找到一些使用引导选项的文档和示例 。

[*] You can find some documentation and examples of the use of boot options in the Linux BootPrompt HOWTO.

[ * ]请注意,使用这些宏并不能完全消除条件指令的使用。内核仍然使用条件指令来设置用户在编译内核时可以配置的选项。

[*] Note that the use of these macros does not eliminate completely the use of conditional directives. The kernel still uses conditional directives to set off options that the user can configure when compiling the kernel.

[ * ]用户模式 ​​Linux 是唯一真正使用关闭例程的体系结构。它不使用_ _exitcall宏,而是定义自己的宏,_ _uml_exitcall. 用户模式 ​​Linux 项目的主页是http://user-mode-linux.sourceforge.net

[*] User-Mode Linux is the only architecture that actually makes use of shutdown routines. It does not use _ _exitcall macros, but defines its own macro, _ _uml_exitcall. The home page of the User-Mode Linux project is http://user-mode-linux.sourceforge.net.

[ * ]这是以下类型的引导时消息所引用的内存:“释放未使用的内核内存:已释放 120k”。

[*] This is the memory that boot-time messages of the following sort refer to: "Freeing unused kernel memory: 120k freed".

[ * ]请参阅第 6 章“ PCI NIC 驱动程序注册示例”部分的快照。

[*] See the snapshot in the section "Example of PCI NIC Driver Registration" in Chapter 6.

第 8 章设备注册和初始化

Chapter 8. Device Registration and Initialization

第 5 章和第 6 章我们了解了内核如何识别 NIC,以及内核执行的初始化,以便 NIC 可以与其设备驱动程序通信。在本章中,我们将讨论初始化的其他阶段:

In Chapters 5 and 6, we saw how NICs are recognized by the kernel, and the initialization that the kernel performs so that the NICs can talk to their device drivers. In this chapter, we will discuss additional stages of initialization:

  • 网络设备何时以及如何向内核注册

  • When and how network devices register with the kernel

  • 网络设备如何向网络设备数据库注册并分配结构net_device实例

  • How a network device registers with the network device database and gets assigned an instance of a net_device structure

  • 如何net_device将结构组织成哈希表和列表以允许不同类型的查找

  • How net_device structures are organized into hash tables and lists to allow different kinds of lookups

  • 实例如何net_device初始化,部分由内核核心例程初始化,部分由其设备驱动程序初始化

  • How net_device instances are initialized, partly by kernel core routines and partly by their device drivers

  • 虚拟设备与真实设备在注册方面有何不同

  • How virtual devices differ from real ones with regard to registration

本章无意成为如何编写 NIC 设备驱动程序的指南。有时我会详细介绍 NIC 设备驱动程序的代码,但我不会涵盖 NIC 设备驱动程序的整个设计。我们在这里只对注册以及设备驱动程序和功能(例如链路状态更改检测和电源管理)之间的接口感兴趣。有关设备驱动程序的详细讨论,请参阅Linux 设备驱动程序(O'Reilly)。

This chapter does not strive to be a guide on how to write NIC device drivers. I sometimes go into detail on an NIC device driver's code, but I will not cover the entire design of an NIC device driver. We are interested here only in registration and in the interface between device drivers and features such as link state change detection and power management. Refer to Linux Device Drivers (O'Reilly) for a detailed discussion of device drivers.

在使用 NIC 之前,net_device 必须初始化其关联的数据结构、将其添加到内核网络设备数据库、进行配置和启用。重要的是不要将注册和取消注册与启用和禁用混淆。它们是两个不同的概念:

Before an NIC can be used, its associated net_device data structure must be initialized, added to the kernel network device database, configured, and enabled. It is important not to confuse registration and unregistration with enabling and disabling. They are two different concepts:

  • 如果我们排除加载设备驱动程序模块的行为,则注册和注销是独立于用户的;内核驱动它们。仅注册的设备尚未运行。我们将在“设备何时注册”和“设备何时取消注册”部分中了解设备何时注册和取消注册。

  • Registration and unregistration, if we exclude the act of loading a device driver module, are user independent; the kernel drives them. A device that has been only registered is not operative yet. We will see when a device is registered and unregistered in the sections "When a Device Is Registered" and "When a Device Is Unregistered."

  • 启用和禁用设备需要用户干预。一旦设备被内核注册,用户就可以通过用户命令查看它、配置它并启用它。请参阅后面的“启用和禁用网络设备”部分。

  • Enabling and disabling a device require user intervention. Once a device has been registered by the kernel, the user can see it by means of user commands, configure it, and enable it. See the later section "Enabling and Disabling a Network Device."

我们首先看看哪些事件会触发网络设备的注册和注销。

Let's start by seeing what events trigger the registration and unregistration of network devices.

设备注册时

When a Device Is Registered

网络设备的注册发生在以下情况:

The registration of a network device takes place in the following situations:

加载NIC的设备驱动程序
Loading an NIC's device driver

如果 NIC 的设备驱动程序内置于内核中,则在启动时初始化;如果将其作为模块加载,则在运行时初始化。每当初始化发生时,该驱动程序控制的所有 NIC 都会被注册。

An NIC's device driver is initialized at boot time if it is built into the kernel, and at runtime if it is loaded as a module. Whenever initialization occurs, all the NICs controlled by that driver are registered.

插入热插拔网络设备
Inserting a hot-pluggable network device

当用户插入热插拔网卡时,内核会通知其驱动程序,然后驱动程序注册该设备。(为了简单起见,我们假设设备驱动程序已经加载。)

When a user inserts a hot-pluggable NIC, the kernel notifies its driver, which then registers the device. (For the sake of simplicity, we'll assume the device driver is already loaded.)

对于第一种情况,适用的注册模型将在后面的“网卡注册和注销的框架”中介绍。它适用于所有总线类型,并且无论注册例程最终是由总线基础设施还是由模块初始化代码调用,它都是相同的。例如,我们在第 6 章中看到加载 PCI 设备驱动程序如何导致例程的执行pci_driver->probe,该例程通常命名为类似 的名称 ,它由驱动程序提供并负责设备注册。在本章中,我们将了解这些例程是如何实现的。xxx _probeprobe

In the first situation, the registration model that applies is described in the later section "Skeleton of NIC Registration and Unregistration." It applies to all bus types, and is the same whether the registration routine ends up being called by the bus infrastructure or by the module initialization code. For example, we saw in Chapter 6 how loading a PCI device driver leads to the execution of the pci_driver->probe routine, usually named something like xxx _probe, which is provided by the driver and which takes care of device registration. In this chapter, we will look at how those probe routines are implemented.

使用其他总线类型(USB、PCMCIA 等)的设备注册共享相同的框架。我们不会研究这些总线的基础设施最终如何调用它们的对应 probe部分,正如我们在第 6 章中看到的 PCI 一样。较旧的总线可能无法自动检测设备的存在,并且可能需要设备驱动程序使用用户提供的默认参数或启动时参数手动探测特定内存地址来完成此操作。[ * ]我们也不会看这个案例。

The registration of devices using other bus types (USB, PCMCIA, etc.) shares the same skeleton. We will not look at how the infrastructure of those buses ends up calling their probe counterpart, as we saw for PCI in Chapter 6. Older buses may not be able to automatically detect the presence of devices and may require the device drivers to do it by manually probing specific memory addresses, using default parameters or boot-time parameters provided by the user.[*] We will not look at this case either.

当设备未注册时

When a Device Is Unregistered

触发设备取消注册的两个主要条件:

Two main conditions trigger the unregistration of a device:

卸载 NIC 设备驱动程序
Unloading an NIC device driver

当然,这仅适用于作为模块加载的驱动程序,不适用于内置于内核中的驱动程序。当管理员卸载 NIC 的设备驱动程序时,必须取消注册所有关联的 NIC。

例如,我们在第 6 章中看到卸载 PCI 设备驱动程序如何导致执行pci_driver->remove驱动程序提供的例程(通常称为类似 的例程),它将处理设备注销。对于针对正在卸载的驱动程序注册的每个设备,PCI 层都会调用此例程一次。在本章中,我们将了解这些例程是如何实现的。xxx _remove_one

This can be done only for drivers loaded as modules, of course, not for those built into the kernel. When the administrator unloads an NIC's device driver, all the associated NICs must be unregistered.

For example, we saw in Chapter 6 how unloading a PCI device driver leads to the execution of the pci_driver->remove routine provided by the driver, often called something like xxx _remove_one, which will take care of device unregistration. This routine is invoked by the PCI layer once for each device registered against the driver being unloaded. In this chapter, we will look at how those routines are implemented.

卸下热插拔网络设备
Removing a hot-pluggable network device

当用户从运行内核支持热插拔设备的系统中删除热插拔 NIC 时,该网络设备将被取消注册。

When a user removes a hot-pluggable NIC from a system whose running kernel has support for hot-pluggable devices, the network device is unregistered.

分配net_device结构

Allocating net_device Structures

网络设备是用net_device结构来定义的。因为它们通常dev在内核代码中命名,所以我在本章中经常使用该名称来表示net_device. 这些数据结构由 分配,在net/core/dev.calloc_netdev中定义,需要三个输入参数:

Network devices are defined with net_device structures. Because they are usually named dev in the kernel code, I use that name frequently in this chapter for a net_device. These data structures are allocated with alloc_netdev, defined in net/core/dev.c, which requires three input parameters:

私有数据结构的大小
Size of private data structure

我们将在“ net_device 结构的组织”一节中看到,net_device设备驱动程序可以使用私有数据块来扩展数据结构来存储驱动程序的参数。该参数指定块的大小。

We will see in the section "Organization of net_device Structures" that the net_device data structure can be extended by device drivers with a private data block to store the driver's parameters. This parameter specifies the size of the block.

设备名称
Device name

这可能是内核将通过某种确保唯一设备名称的方案来完成的部分名称。

This may be a partial name that the kernel will complete through some scheme that ensures unique device names.

设置例程
Setup routine

该例程用于初始化 的部分字段net_device。有关详细信息,请参阅“设备初始化”和“设备类型初始化:xxx_setup 函数”部分。

This routine is used to initialize a portion of the net_device's fields. See the sections "Device Initialization" and "Device Type Initialization: xxx_setup Functions" for more details.

返回值是指向分配的结构的指针net_device ,如果出现错误,则返回 NULL。

The return value is a pointer to the net_device structure allocated, or NULL in case of errors.

每个设备都分配有一个名称,该名称取决于设备类型,并且为了保持唯一性,该名称包含在注册相同类型的设备时按顺序分配的编号。例如,以太网设备称为eth0eth1等。根据设备注册的顺序,单个设备可能会被称为不同的名称。例如,如果您有两个由两个不同模块处理的卡,则设备的名称将取决于两个模块的加载顺序。热插拔设备特别适合意外的名称更改。

Every device is assigned a name that depends on the device type and that, to be unique, contains a number that is assigned sequentially as devices of the same type are registered. Ethernet devices, for instance, are called eth0, eth1, and so on. A single device may be called with different names depending on the order with which the devices are registered. For instance, if you had two cards handled by two different modules, the names of the devices would depend on the order in which the two modules were loaded. Hot-pluggable devices lend themselves particularly to unanticipated name changes.

由于用户空间配置工具引用内核分配的设备名称,因此设备注册的顺序很重要。由于这是一个用户空间细节,我不会再费心去讨论它,只是要提一下,有一些工具(例如net-tools包中的nameif)允许您根据 MA​​C 地址为接口分配固定名称。

Because user-space configuration tools refer to the kernel-assigned device name, the order with which devices register is important. As this is a user-space detail, I will not bother with it further, except to mention that there are tools, such as nameif from the net-tools package, that allow you to assign fixed names to interfaces based on the MAC address.

当传递给的设备名称alloc_netdev 采用以下形式(例如“ ”)时,内核使用函数补全名称。后者更改为该设备类型的第一个未分配的编号。name %deth%ddev_alloc_name%d

When the name of the device passed to alloc_netdev is in the form name %d (e.g., "eth%d"), the kernel completes the name using the function dev_alloc_name. The latter changes %d to the first unassigned number for that device type.

内核还提供了一组包装器alloc_netdev,其中一些在表 8-1中列出,它们可用于 alloc_netdev为一组常见设备类型提供正确的参数。[ * ]例如,alloc_etherdev用于以太网设备,因此以字符串eth后跟唯一编号的形式创建设备名称。作为其第二个参数,它指定ether_setup为设置例程,该例程将结构的一部分初始化net_device为所有以太网设备通用的值。

The kernel also provides a set of wrappers around alloc_netdev, a few of which are listed in Table 8-1, which can be used to feed alloc_netdev the correct parameters for a set of common device types.[*] For example, alloc_etherdev is used for Ethernet devices, and therefore creates a device name in the form of the string eth followed by a unique number. As its second argument, it specifies ether_setup as the setup routine, which initializes a portion of the net_device structure to values common to all Ethernet devices.

表 8-1。alloc_netdev 函数的包装器

Table 8-1. Wrappers for the alloc_netdev function

网络设备类型

Network device type

包装名称

Wrapper name

包装定义

Wrapper definition

以太网

Ethernet

alloc_etherdev

alloc_etherdev

return alloc_netdev(sizeof_priv, "eth%d", ether_setup);

return alloc_netdev(sizeof_priv, "eth%d", ether_setup);

光纤分布式数据接口

Fiber Distributed Data Interface

alloc_fddidev

alloc_fddidev

return alloc_netdev(sizeof_priv, "fddi%d", fddi_setup);

return alloc_netdev(sizeof_priv, "fddi%d", fddi_setup);

高性能并行接口

High Performace Parallel Interface

alloc_hippi_dev

alloc_hippi_dev

return alloc_netdev(sizeof_priv, "hip%d", hippi_setup);

return alloc_netdev(sizeof_priv, "hip%d", hippi_setup);

令牌环

Token Ring

alloc_trdev

alloc_trdev

return alloc_netdev(sizeof_priv, "tr%d", tr_setup);

return alloc_netdev(sizeof_priv, "tr%d", tr_setup);

光纤通道

Fibre Channel

alloc_fcdev

alloc_fcdev

return alloc_netdev(sizeof_priv, "fc%d", fc_setup);

return alloc_netdev(sizeof_priv, "fc%d", fc_setup);

红外数据协会

Infrared Data Association

alloc_irdadev

alloc_irdadev

return alloc_netdev(sizeof_priv, "irda%d", irda_device_setup);

return alloc_netdev(sizeof_priv, "irda%d", irda_device_setup);

NIC注册和注销的骨架

Skeleton of NIC Registration and Unregistration

图 8-1(a)显示了 NIC 设备驱动程序向网络代码注册的通用方案。图 8-1(b)显示了注销时发生的补充操作。尽管示例显示的是 PCI 以太网 NIC,但该方案对于其他设备类型是相同的;只有处理它的例程的名称或调用该例程的方式可能会根据总线代码的实现方式而改变。

Figure 8-1(a) shows the generic scheme for an NIC's device driver to register with the networking code. Figure 8-1(b) shows the complementary action that takes place for unregistration. Although the example shows a PCI Ethernet NIC, the scheme is the same for other device types; only the name of the routine that takes care of it, or the way that routine is invoked, may change depending on how the bus code is implemented.

(a) 设备注册模型; (b) 设备注销模型

图 8-1。(a) 设备注册模型;(b) 设备注销模型

Figure 8-1. (a) Device registration model; (b) device unregistration model

该函数首先分配net_device结构与alloc_etherdev. alloc_etherdev还初始化所有以太网设备通用的所有参数。然后,驱动程序初始化该net_device 结构的另一部分,并通过调用例程来结束设备注册register_netdev

The function starts by allocating the net_device structure with alloc_etherdev. alloc_etherdev also initializes all the parameters that are common to all Ethernet devices. The driver then initializes another portion of the net_device structure, and concludes the device registration with a call to the register_netdev routine.

注意:

Note that:

  • 驱动程序调用适当的包装器alloc_netdevalloc_etherdev在示例中),并仅提供其私有数据块的大小。表 8-1列出了一些包装器。

  • The driver calls the appropriate wrapper around alloc_netdev (alloc_etherdev in the example), and provides only the size of its private data block. A few wrappers are listed in Table 8-1.

  • 包装器alloc_netdev使用驱动程序提供的参数进行调用,并添加其他两个(设备名称和初始化例程)。

  • The wrapper calls alloc_netdev using the parameter provided by the driver, and adds the other two (the device name and the initialization routine).

  • 分配的内存块的大小alloc_netdev包括net_device 结构、驱动程序的私有块和一些强制对齐的填充。请参见 本章后面的图 8-2 。

  • The size of the memory block allocated by alloc_netdev includes the net_device structure, the driver's private block, and some padding to force an alignment. See Figure 8-2 later in the chapter.

  • 某些驱动程序调用netdev_boot_setup_check来检查用户在加载内核时是否提供了任何启动时参数。请参阅第 7 章中的“使用引导选项配置网络设备”部分。

  • Some drivers call netdev_boot_setup_check to check whether the user provided any boot-time parameter when loading the kernel. See the section "Use of Boot Options to Configure Network Devices" in Chapter 7.

  • net_device实例将被插入到设备数据库中(请参阅后面的“设备注册register_netdevice”部分)。顺便说一句,我在这里以及本书的其他部分使用术语“数据库” 来宽泛地指代数据结构的组合,这些数据结构可以方便地访问有关内核所需术语的信息。

  • The new net_device instance is inserted into the device database with register_netdevice (see the later section "Device Registration"). Incidentally, I use the term database here, and in other parts of the book, to refer loosely to a combination of data structures that provides convenient access to information on the terms the kernel needs.

设备的注销(如图 8-1(b)unregister_netdevice中的简单形式所示)始终包含对和 的调用 free_netdev。有时是显式调用free_netdev,有时是通过dev->destructor函数[ * ]间接调用,如图 8-4所示。设备驱动程序还需要释放设备使用的任何资源(IRQ、内存映射等),但我们对本章中的这些细节不感兴趣。

The unregistration of a device, shown in its simple form in Figure 8-1(b), always includes a call to unregister_netdevice and free_netdev. The call to free_netdev is sometimes made explicitly, and sometimes indirectly via the dev->destructor function,[*] as shown later in Figure 8-4. The device driver also needs to release any resources used by the device (IRQ, memory mappings, etc.), but we are not interested in those details in this chapter.

设备初始化

Device Initialization

在“设备注册时”一节中,我们了解了内核与 NIC 通信需要初始化的内容。在本章的其余部分中,我们将讨论更高级别的初始化任务。

In the section "When a Device Is Registered," we saw what needs to be initialized for the kernel to communicate to the NIC. In the rest of this chapter we will look at higher-level initialization tasks.

结构net_device相当大。它的字段由不同的例程分块初始化,每个例程负责不同的字段子集。[ ]特别是:

The net_device structure is pretty big. Its fields are initialized in chunks by different routines, each one responsible for a different subset of fields.[] In particular:

设备驱动程序
Device drivers

IRQ、I/O 内存和 I/O 端口等参数的值取决于硬件配置,由设备驱动程序处理。参见第 5 章

Parameters such as IRQ, I/O memory, and I/O port, whose values depend on the hardware configuration, are taken care of by the device driver. See Chapter 5.

设备类型
Device type

的初始化设备类型系列的所有设备共有的字段由例程处理。例如,以太网设备使用. 请参阅“设备类型初始化:xxx_setup 函数”部分。xxx _setupether_setup

The initialization of fields common to all the devices of a device type family is taken care by the xxx _setup routines. For example, Ethernet devices use ether_setup. See the section "Device Type Initialization: xxx_setup Functions."

特征
Features

强制和可选功能也需要初始化。例如,排队规则(即 QoS)在 中初始化,如“ register_netdevice 函数register_netdevice”部分中所述。当相关模块收到有关新设备注册的通知时,可以初始化其他功能(请参阅“设备注册状态通知”部分)。

Mandatory and optional features also need to be initialized. For example, the queuing discipline (i.e., QoS) is initialized in register_netdevice, as described in the section "register_netdevice Function." Other features can be initialized when the associated modules are notified about the registration of the new device (see the section "Device Registration Status Notification").

设备类型初始化是作为设备驱动程序初始化的一部分完成的(即由调用),以便驱动程序有机会覆盖默认设备类型的初始化。有关示例,请参阅“可选初始化和特殊情况”部分。xxx _setupxxx _probe

The device type initialization is done as part of the device driver initialization (that is, xxx _setup is called by xxx _probe) so that the driver has a chance to overwrite the default device type's initializations. See the section "Optional Initializations and Special Cases" for an example.

表 8-2显示了由例程初始化的函数指针以及留给设备驱动程序的内容[ * ] ( ):什么是特定于设备类型的以及什么是特定于设备型号的。请注意,并非所有设备驱动程序都遵循表 8-2中的区别。例如,在某些情况下,该函数不会初始化任何函数指针(例如net/irda/irda_device.c ),而在其他情况下 ,它会初始化所有函数指针(例如drivers/net/wireless/airo.c) 。xxx _setupxxx _probexxx _setupirda_device_setup in wifi_setup

Table 8-2 shows the function pointers that are initialized by the xxx _setup routines and what is left to the device driver[*] (xxx _probe): what is device-type specific and what is device-model specific. Note that not all device drivers respect the distinction in Table 8-2. For instance, there are cases where the xxx _setup function does not initialize any function pointer (an example is irda_device_setup in net/irda/irda_device.c) and others where it initializes all of them (an example is wifi_setup in drivers/net/wireless/airo.c).

表 8-2。由 xxx_setup 和 xxx_probe 初始化的 net_device 函数指针

Table 8-2. net_device function pointers initialized by xxx_setup and xxx_probe

初始化器

Initializer

函数指针名称

Function pointer name

xxx _setup

xxx _setup

更改_mtu
设置mac地址
重建头文件
硬头文件
硬头缓存
header_cache_update
硬头解析
change_mtu
set_mac_address
rebuild_header
hard_header
hard_header_cache
header_cache_update
hard_header_parse

设备驱动程序的探测例程

Device driver's probe routine

打开
停止
硬启动xmit
tx_超时
看门狗_timeo
获取统计信息
获取无线统计信息
设置多播列表
do_ioctl
在里面
统一初始化
轮询
ethtool_ops (这实际上是一个例程数组)
open
stop
hard_start_xmit
tx_timeout
watchdog_timeo
get_stats
get_wireless_stats
set_multicast_list
do_ioctl
init
uninit
poll
ethtool_ops (this is actually an array of routines)

表 8-3与表 8-2类似,但它列出了一些其他net_device字段,而不是函数指针。

Table 8-3 is similar to Table 8-2, but instead of function pointers it lists some of the other net_device fields.

表 8-3。由 xxx_setup 和 xxx_probe 初始化的 net_device 字段

Table 8-3. net_device fields initialized by xxx_setup and xxx_probe

初始化器

Initializer

变量名

Variable name

xxx_设置

xxx_setup

类型
硬头长度
最大传输单元
地址长度
tx_queue_len
播送
旗帜
type
hard_header_len
mtu
addr_len
tx_queue_len
broadcast
flags

设备驱动程序的探测例程

Device driver's probe routine

基地址
中断
如果_端口
私人
特征
base_addr
irq
if_port
priv
features

有关表8-2表8-3中字段含义的更多详细信息,请参阅第2 章

For more details on the meaning of the fields in Tables 8-2 and 8-3, refer to Chapter 2.

设备驱动程序初始化

Device Driver Initializations

设备驱动程序初始化的字段通常由第 6 章“总体情况”部分介绍的函数net_device处理 ,如图 8-1(a)所示。xxx_ probe

The net_device fields initialized by the device driver are usually taken care of by the xxx_ probe function introduced in the section "The Big Picture" in Chapter 6, and depicted in Figure 8-1(a).

有些驱动程序可以处理不同的设备模型;因此,相同的参数可以根据设备型号和功能进行不同的初始化。以下快照来自drivers/net/3c59x.chard_start_xmit驱动程序,显示我们将在第 11 章中介绍的函数根据设备的功能进行不同的初始化:[ * ]

Some drivers can handle different device models; so the same parameters can be initialized differently based on the device model and capabilities. The following snapshot, from the drivers/net/3c59x.c driver, shows that the function hard_start_xmit, which we will introduce in Chapter 11, is initialized differently depending on the device's capabilities:[*]

    if (vp->功能 & CapBusMaster) {
        vp->full_bus_master_tx = 1;
            …………
    }
    …………
    如果(vp->full_bus_master_tx){
        dev->hard_start_xmit = boomerang_start_xmit;
            …………
    } 别的 {
        dev->hard_start_xmit = vortex_start_xmit;
    }
    if (vp->capabilities & CapBusMaster) {
        vp->full_bus_master_tx = 1;
            ... ... ...
    }
    ... ... ...
    if (vp->full_bus_master_tx) {
        dev->hard_start_xmit = boomerang_start_xmit;
            ... ... ...
    } else {
        dev->hard_start_xmit = vortex_start_xmit;
    }

设备类型初始化:xxx_setup 函数

Device Type Initialization: xxx_setup Functions

对于最常见的网络设备类型,有一个函数可以初始化所有相同类型设备(例如所有以太网卡)所共有的结构字段 (参数和函数指针)。xxx _setupnet_device

For the most common network device types there is an xxx _setup function to initialize the fields of the net_device structure (both parameters and function pointers) that are common to all the devices of the same type—for instance, all Ethernet cards.

表 8-1中,您看到了各个函数如何将正确的 例程传递给(作为第三个输入参数)。下面是例程,这是 以太网设备使用的例程:alloc_ xxx devxxx _setupalloc_netdevether_setupxxx _setup

In Table 8-1, you saw how the various alloc_ xxx dev functions pass the right xxx _setup routine to alloc_netdev (as the third input parameter). Here is the ether_setup routine, which is the xxx _setup routine used by Ethernet devices:

void ether_setup(struct net_device *dev)
{
    dev->change_mtu = eth_change_mtu;
    dev->hard_header = eth_header;
    dev->rebuild_header = eth_rebuild_header;
    dev->set_mac_address = eth_mac_addr;
    dev->hard_header_cache = eth_header_cache;
    dev->header_cache_update = eth_header_cache_update;
    dev->hard_header_parse = eth_header_parse;

    dev->类型= ARPHRD_ETHER;
    dev->hard_header_len = ETH_HLEN;
    dev->mtu = 1500;
    dev->addr_len = ETH_ALEN;
    dev->tx_queue_len = 1000;
    dev->flags = IFF_BROADCAST|IFF_MULTICAST;

    memset(dev->广播,0xFF, ETH_ALEN);
}
void ether_setup(struct net_device *dev)
{
    dev->change_mtu           = eth_change_mtu;
    dev->hard_header          = eth_header;
    dev->rebuild_header       = eth_rebuild_header;
    dev->set_mac_address      = eth_mac_addr;
    dev->hard_header_cache    = eth_header_cache;
    dev->header_cache_update  = eth_header_cache_update;
    dev->hard_header_parse    = eth_header_parse;

    dev->type                 = ARPHRD_ETHER;
    dev->hard_header_len      = ETH_HLEN;
    dev->mtu                  = 1500;
    dev->addr_len             = ETH_ALEN;
    dev->tx_queue_len         = 1000;
    dev->flags                = IFF_BROADCAST|IFF_MULTICAST;

    memset(dev->broadcast,0xFF, ETH_ALEN);
}

如您所见,该函数仅初始化可由任何以太网卡共享的字段和函数指针:1,500 的 MTU、FF:FF:FF:FF:FF:FF 的链路层广播地址、出口队列1,000 个数据包的长度,[ * ]等。

As you can see, this function initializes only the fields and function pointers that can be shared by any Ethernet card: an MTU of 1,500, a link-layer broadcast address of FF:FF:FF:FF:FF:FF, an egress queue length of 1,000 packets,[*] etc.

使用通用分配包装器和例程(如表 8-1所示)是最常见的方法。然而:xxx _setup

The use of a generic allocation wrapper and the xxx _setup routine, as shown in Table 8-1, is the most common approach. However:

  • 某些类型的设备定义了setup 函数,但不提供类似于表 8-1中的通用包装器。其中包括 ARCNET [ ]设备(参见arcdev_setupdrivers /net/arcnet/arcnet.c)和 IrDA [ ]设备(参见irda_device_setupnet /irda/irda_device.c)。

  • Some classes of devices define setup functions but do not provide a generic wrapper similar to the ones in Table 8-1. Among them are ARCNET[] devices (see arcdev_setup in drivers/net/arcnet/arcnet.c) and IrDA[] devices (see irda_device_setup in net/irda/irda_device.c).

  • 不属于指定类别的设备可以使用通用名称。是一个例子:它也被非以太网设备使用。当特定例程的大部分初始化满足设备驱动程序的需要时,后者可以使用该例程并简单地覆盖那些不正确的初始化。但这种方法并不常见。xxx _setupether_setupxxx _setupxxx _setup

  • A generic xxx _setup may be used by devices that do not belong to the indicated class. ether_setup is an example: it is used by non-Ethernet devices as well. When most of the initializations of a particular xxx _setup routine suit the needs of a device driver, the latter may use that xxx _setup routine and simply override those initializations that are not correct. But this approach is not common.

  • 以太网驱动程序可以使用由 提供的默认初始化ether_setup(由 间接调用alloc_etherdev),但会覆盖某些初始化。例如,3c59x.c 驱动程序不使用net_device->mtu所设置的值ether_setup,而是使用局部变量覆盖它。此变量初始化为与 设置的默认值相同 ether_setup,但驱动程序可以为可以处理它们的 NIC 型号设置更大的值。

  • An Ethernet driver can use the default initialization provided by ether_setup (which is invoked indirectly by alloc_etherdev) but override some of the initializations. For example, the 3c59x.c driver does not use the net_device->mtu value set by ether_setup, but overrides it with a local variable. This variable is initialized to the same default that would be set by ether_setup, but the driver can set bigger values for NIC models that can handle them.

可选的初始化和特殊情况

Optional Initializations and Special Cases

在某些情况下,某些net_device参数没有初始化只是因为它们对于该类型的设备没有意义;关联的函数指针或值未初始化,因此保留为 NULL。

There are cases when some net_device parameters are not initialized simply because they are meaningless for that type of device; the associated function pointer or value is not initialized and therefore is left to NULL.

为了避免 NULL 指针引用,内核始终确保在调用可选函数指针之前初始化它们,[ * ],如以下示例所示register_netdevice

To avoid NULL pointer references, the kernel always makes sure that optional function pointers are initialized before invoking them,[*] as in the following example from register_netdevice:

if (dev->init && dev->init(dev) != 0) {
    ...
}
if (dev->init && dev->init(dev) != 0) {
    ...
}

值得注意的是,外部因素也可能改变表 8-28-3字段的初始化方式和位置。一个例子涉及该net_device->mtu领域。虚拟设备通常从与其关联的真实设备继承配置参数,然后根据需要进行调整。例如,IP-over IP 协议创建的虚拟隧道接口继承dev->mtu自与其关联的真实设备。(这不是自动的;虚拟设备驱动程序会处理它。)但是,由于 IP-over-IP 协议需要额外的 IP 标头,因此需要相应地降低 MTU(请参阅 net/ipv4/ipip.h 中的内容ipip_tunnel_xmitC, 假设有一个底层以太网设备)。

It is important to note that external factors could also change how and where the fields of Tables 8-2 and 8-3 are initialized. One example involves the net_device->mtu field. Virtual devices usually inherit configuration parameters from the real devices they are associated with, and then adjust them if needed. For example, virtual tunnel interfaces created by the IP-over-IP protocol inherit dev->mtu from the real devices they are associated with. (This is not automatic; the virtual device driver takes care of it.) However, due to the extra IP header needed by the IP-over-IP protocol, the MTU needs to be lowered accordingly (see ipip_tunnel_xmit in net/ipv4/ipip.c, which assumes an underlying Ethernet device).

net_device 结构的组织

Organization of net_device Structures

该结构的一些更微妙的方面net_device包括以下内容:

Some of the subtler aspects of the net_device structure include the following:

  • 我们在“分配 net_device 结构”一节中看到,当alloc_netdev调用分配net_device结构时,会传递驱动程序私有数据块的大小(其大小取决于驱动程序 - 有些甚至根本不使用私有数据)。 alloc_netdev将私有数据附加到 net_device结构中。图 8-1显示了该参数的传递方式,图 8-2显示了对内存分配的影响。

  • We saw in the section "Allocating net_device Structures" that when alloc_netdev is called to allocate a net_device structure, it is passed the size of the driver's private data block (whose size depends on the driver—some do not even use private data at all). alloc_netdev appends the private data to the net_device structure. Figure 8-1 showed how that parameter is passed and Figure 8-2 shows the effect on the memory allocation.

  • 图 8-2还显示了数据结构和可选驱动程序的私有数据结构之间的关系net_device。通常,第二部分与第一部分一起分配,因此单个kmalloc就足够了,但也有驱动程序更喜欢自己分配其私有块的情况(参见图 8-2中的驱动程序 C )。

  • Figure 8-2 also shows the relationship between the net_device data structure and the optional driver's private data structure. Normally, the second part is allocated together with the first one so that a single kmalloc is sufficient, but there are also cases where the driver prefers to allocate its private block by itself (see driver C in Figure 8-2).

  • 如图 8-2中的示例所示,驱动程序私有块的大小及其内容不仅从一种设备类型到另一种设备类型(例如,令牌环与以太网)发生变化,而且在同一类型的设备之间(例如,两个设备)也发生变化。不同的以太网卡)。

  • As shown in the example in Figure 8-2, the size of the driver's private block and its content change not only from one device type to another (e.g., Token Ring versus Ethernet) but also among devices of the same type (e.g., two different Ethernet cards).

  • dev_base(本节稍后介绍)和next指针net_device指向net_device结构的开头,而不是指向分配块的开头。但是,初始填充的大小保存在 中dev->padded,这允许内核在需要时释放整个内存块。

  • dev_base (introduced later in this section) and the next pointer in net_device point to the beginning of the net_device structure, not to the beginning of the allocated block. However, the size of the initial padding is saved in dev->padded, which allows the kernel to release the whole memory block when it is time to do so.

net_device数据结构既插入到全局列表中(如图 8-2所示) ,也插入到两个哈希表中(如图 8-3所示)。net_device这些不同的结构允许内核根据需要轻松地浏览或查找数据库 。详细信息如下:

net_device data structures are inserted both in a global list, as shown in Figure 8-2, and in two hash tables, as shown in Figure 8-3. These different structures allow the kernel to easily browse or look up the net_device database as required. Here are the details:

dev_base
dev_base

这份全球清单所有net_device实例都允许内核轻松浏览设备,例如,它必须获取一些统计信息,根据用户命令更改所有设备的配置,或者查找与给定搜索条件匹配的设备。

因为每个驱动程序都有自己的私有数据结构定义,所以结构的全局列表net_device可以将不同大小的元素链接在一起(参见图 8-2)。

已注册设备的全局列表

图 8-2。已注册设备的全局列表

This global list of all net_device instances allows the kernel to easily browse devices in case, for instance, it has to get some statistics, change a configuration across all devices as a consequence of a user command, or find devices matching given search criteria.

Because each driver has its own definition for the private data structure, the global list of net_device structures may link together elements of different sizes (see Figure 8-2).

Figure 8-2. Global list of registered devices

dev_name_head
dev_name_head

这是一个根据设备名称索引的哈希表。例如,当通过ioctl界面应用配置更改时,它很有用。通过接口与内核通信的老一代配置工具ioctl通常通过设备名称来引用设备。

This is a hash table indexed on the device name. It is useful, for instance, when applying a configuration change via the ioctl interface. The old-generation configuration tools that talk to the kernel via the ioctl interface usually refer to devices by their names.

dev_index_head
dev_index_head

这是一个以设备 ID 为索引的哈希表dev->ifindex。对结构的交叉引用net_device通常存储设备 ID 或结构指针 net_devicedev_index_head对前者有用。此外,新一代配置工具ip(来自 IPROUTE2 软件包)通过 Netlink 套接字与内核对话,通常通过设备 ID 来引用设备。

This is a hash table indexed on the device ID dev->ifindex. Cross-references to net_device structures usually store either device IDs or pointers to net_device structures; dev_index_head is useful for the former. Also, the new-generation configuration tool ip (from the IPROUTE2 package), which talks to the kernel via the Netlink socket, usually refers to devices by their ID.

用于根据设备名称和设备索引搜索 net_device 实例的哈希表

图 8-3。用于根据设备名称和设备索引搜索 net_device 实例的哈希表

Figure 8-3. Hash tables used to search net_device instances based on device name and device index

查找

Lookups

最常见的查找基于设备名称或设备 ID。dev_get_by_name这两种查找类型由和实现dev_get_by_index,它使用上一节中讨论的两个哈希表。还可以net_device根据实例的设备类型、MAC 地址等来搜索实例。此类查找使用列表dev_base

The most common lookups are based either on the device name or on the device ID. These two lookup types are implemented by dev_get_by_name and dev_get_by_index, which use the two hash tables discussed in the previous section. It is also possible to search net_device instances based on their device type, MAC address, etc. These kinds of lookups use the dev_base list.

所有查找,无论是在dev_base列表上还是在两个哈希表上,都受到dev_base_lock 锁的保护。

All lookups, both on the dev_base list and on the two hash tables, are protected by the dev_base_lock lock.

所有查找例程都在net/core/dev.c中定义。

All lookup routines are defined in net/core/dev.c.

设备状态

Device State

net_device结构包含定义当前状态的不同字段装置。这些包括:

The net_device structure includes different fields that define the current state of the device. These include:

flags
flags

位图用于存储不同的标志。其中大多数代表设备的功能。然而,其中之一,IFF_UP,用于表示设备是启用(向上)还是禁用(向下)。您可以在include/linux/if.hIFF_ XXX中找到标志列表。另请参阅“启用和禁用网络设备”部分。

Bitmap used to store different flags. Most of them represent a device's capabilities. However, one of them, IFF_UP, is used to say whether the device is enabled (up) or disabled (down). You can find the list of IFF_ XXX flags in include/linux/if.h. See also the section "Enabling and Disabling a Network Device."

reg_state
reg_state

设备注册状态。“注册状态”部分列出了该字段可以分配的值以及其值何时更改。

Device registration state. The section "Registration State" lists the values this field can be assigned and when its value changes.

state
state

关于排队规则的设备状态请参阅“排队规则状态”部分。

Device state with regard to its queuing discipline . See the section "Queuing Discipline State."

有时您可能会发现这些变量之间有一点重叠。例如,每次IFF_UP被设置为flags_ _LINK_STATE_START被设置为 state,反之亦然。它们分别由dev_open和设定和清除dev_close。然而,它们的领域不同,在编写模块化代码时有时可能会引入一点重叠。

You may find a little bit of overlap sometimes between these variables. For example, every time IFF_UP is set in flags, _ _LINK_STATE_START is set in state, and vice versa. Both of them are set and cleared, respectively, by dev_open and dev_close. However, their domains are different, and a little bit of overlap may sometimes be introduced when writing modular code.

排队规则状态

Queuing Discipline State

每个网络设备都分配有一个排队规则,流量控制使用该规则来实现其 QoS 机制。state的字段是net_device流量控制使用的结构字段之一。state是一个位图,下面的列表显示了可以设置的标志。它们在 include/linux/netdevice.h中定义。

Each network device is assigned a queuing discipline, which is used by Traffic Control to implement its QoS mechanisms. The state field of net_device is one of the structure's fields used by Traffic Control. state is a bitmap, and the following list shows the flags that can be set. They are defined in include/linux/netdevice.h.

_ _LINK_STATE_START
_ _LINK_STATE_START

设备已启动。可以使用 检查该标志netif_running请参阅“启用和禁用网络设备”部分。

The device is up. This flag can be checked with netif_running. See the section "Enabling and Disabling a Network Device."

_ _LINK_STATE_PRESENT
_ _LINK_STATE_PRESENT

该设备存在。这个标志可能看起来多余;但请考虑到热插拔设备可以暂时移除。当系统进入挂起模式然后恢复时,该标志也会分别被清除和恢复。可以使用 检查该标志netif_device_present请参阅“ register_netdevice 函数”部分。

The device is present. This flag may look superfluous; but take into account that hot-pluggable devices can be temporally removed. The flag is also cleared and restored, respectively, when the system goes into suspend mode and then resumes. The flag can be checked with netif_device_present. See the section "register_netdevice Function."

_ _LINK_STATE_NOCARRIER
_ _LINK_STATE_NOCARRIER

没有载体。可以使用 检查该标志netif_carrier_ok请参阅“链路状态变化检测”部分。

There is no carrier. The flag can be checked with netif_carrier_ok. See the section "Link State Change Detection."

_ _LINK_STATE_LINKWATCH_EVENT
_ _LINK_STATE_LINKWATCH_EVENT

设备的链接状态已更改。请参阅“调度和处理链路状态更改事件”部分。

The device's link state has changed. See the section "Scheduling and processing link state change events."

_ _LINK_STATE_XOFF
_ _LINK_STATE_XOFF

_ _LINK_STATE_SHED
_ _LINK_STATE_SHED

_ _LINK_STATE_RX_SCHED
_ _LINK_STATE_RX_SCHED

这三个标志由管理设备上的入口和出口流量的代码使用。我们将在第三部分看到它们是如何使用的。

These three flags are used by the code that manages ingress and egress traffic on the device. We will see how they are used in Part III.

注册州

Registration State

的状态设备向网络堆栈的注册信息保存在reg_statenet_device结构的字段中。它可以采用的值在结构定义中的include/linux/netdevice.hNETREG_ XXX中定义。在下一节中,我们将了解它们之间的关系。这里有一个简单的描述:net_device

The state of a device with regard to its registration with the network stack is saved in the reg_state field of the net_device structure. The NETREG_ XXX values it can take are defined in include/linux/netdevice.h, within the net_device structure definition. In the next section, we will see how they relate to each other. Here is a brief description:

NETREG_UNINITIALIZED
NETREG_UNINITIALIZED

定义为0。当net_device数据结构被分配并且其内容清零时,该值代表 中的0 dev->reg_state

Defined as 0. When the net_device data structure is allocated and its contents zeroed, this value represents the 0 in dev->reg_state.

NETREG_REGISTERING
NETREG_REGISTERING

该结构已添加到后面“ net_device 结构的组织net_device”部分列出的结构中,但内核仍需要向/sys文件系统添加一个条目。

The net_device structure has been added to the structures listed in the later section "Organization of net_device Structures," but the kernel still needs to add an entry to the /sys filesystem.

NETREG_REGISTERED
NETREG_REGISTERED

设备已完全注册。

The device has been fully registered.

NETREG_UNREGISTERING
NETREG_UNREGISTERING

net_device结构已从后面的“ net_device 结构的组织”部分列出的结构中删除。

The net_device structure has been removed from the structures listed in the later section "Organization of net_device Structures."

NETREG_UNREGISTERED
NETREG_UNREGISTERED

该设备已完全取消注册(包括从/sys中删除条目 ),但该net_device结构尚未释放。

The device has been fully unregistered (which includes removing the entry from /sys), but the net_device structure has not been freed yet.

NETREG_RELEASED
NETREG_RELEASED

所有对该结构的引用net_device 均已发布。从网络代码的角度来看,可以释放数据结构。然而,这将由sysfs来处理。请参阅“引用计数”部分。

All the references to the net_device structure have been released. The data structure can be freed, from the networking code's perspective. However, it will be up to sysfs to take care of it. See the section "Reference Counts."

注册和取消注册设备

Registering and Unregistering Devices

register_netdev网络设备分别通过和 向内核注册和取消注册unregister_netdev。这些是简单的包装器,负责锁定,然后分别调用例程register_netdeviceunregister_netdevice。我们已经在图8-1中简要介绍了这些功能。所有这些都在net/core/dev.c中定义。

Network devices are registered and unregistered with the kernel with register_netdev and unregister_netdev, respectively. These are simple wrappers that take care of locking and then invoke the routines register_netdevice and unregister_netdevice, respectively. We already briefly introduced these functions in Figure 8-1. All of them are defined in net/core/dev.c.

图 8-4net_device显示了 a可以设置的注册状态,并显示了上述例程出现的位置。它还显示了其他关键例程的调用位置。所有这些都将在后面的部分中进行描述。特别要注意的是:

Figure 8-4 shows the registration states a net_device can be set to, and shows where the aforementioned routines come into the picture. It also shows where other key routines are called. All of them will be described in later sections. In particular, note that:

  • 状态的改变可以使用NETREG_UNINITIALIZED和之间的中间状态NETREG_REGISTERED。这些进程由 处理,如“分割操作:netdev_run_todonetdev_run_todo ”部分中所述。

  • Changes of state may use intermediate states between NETREG_UNINITIALIZED and NETREG_REGISTERED. These progressions are handled by netdev_run_todo, described in the section "Split Operations: netdev_run_todo."

  • 设备驱动程序在注册时可以使用这两个net_device虚函数inituninit分别初始化和清理私有数据 并注销一个装置。它们主要由虚拟设备使用。请参阅“虚拟设备”部分。

  • The two net_device virtual functions init and uninit can be used by device drivers to initialize and clean up private data, respectively, when registering and unregistering a device. They are mainly used by virtual devices. See the section "Virtual Devices."

  • net_device在释放对关联数据结构的所有引用之前,设备的取消注册无法完成:netdev_wait_allrefs在满足该条件之前不会返回。请参阅“引用计数”部分。

  • The unregistration of a device cannot be completed until all references to the associated net_device data structure have been released: netdev_wait_allrefs does not return until that condition is met. See the section "Reference Counts."

  • 设备的注册和注销都是通过 完成的netdev_run_todo。我们将在“拆分操作:netdev_run_todo ” 部分中了解如何register_netdevice与.unregister_netdevicenetdev_run_todo

  • Both the registration and unregistration of a device are completed by netdev_run_todo. We will see in the section "Split Operations: netdev_run_todo" how register_netdevice and unregister_netdevice interact with netdev_run_todo.

net_device的注册状态机

图 8-4。net_device的注册状态机

Figure 8-4. net_device's registration state machine

拆分操作:netdev_run_todo

Split Operations: netdev_run_todo

register_netdevice负责注册的一部分,然后让我们netdev_run_todo完成它。起初,通过查看代码可能并不清楚这是如何发生的。让我们借助图 8-4看看它是如何工作的。

register_netdevice takes care of a portion of the registration, and then lets netdev_run_todo complete it. At first, it may not be clear how this happens by looking at the code. Let's see how it works with the help of Figure 8-4.

对结构的更改通过和 net_device受到 Routing Netlink 信号量的保护,这就是为什么在开始时获取锁(信号量)并在返回之前释放它的原因(更多详细信息请参阅“锁定”部分)。一旦完成其工作,它就会将新结构添加到with中。该列表包含必须完成注册(或取消注册,我们稍后会看到)的设备。该列表不是由单独的内核线程或周期性计时器处理的;释放锁时将由间接处理它。rtnl_lockrtnl_unlockregister_netdevregister_netdevicenet_devicenet_todo_listnet_set_todoregister_netdev

Changes to net_device structures are protected with the Routing Netlink semaphore via rtnl_lock and rtnl_unlock, which is why register_netdev acquires the lock (semaphore) at the beginning and releases it before returning (more details in the section "Locking"). Once register_netdevice is done with its job, it adds the new net_device structure to net_todo_list with net_set_todo. That list contains the devices whose registration (or unregistration, as we will see in a moment) has to be completed. The list is not processed by a separate kernel thread or by means of a periodic timer; it will be up to register_netdev to indirectly process it when releasing the lock.

这样,rtnl_unlock不仅释放了锁,还调用了netdev_run_todo. [ * ]后一个函数浏览net_todo_list数组并完成其所有实例的注册net_device

Thus, rtnl_unlock not only releases the lock, but also calls netdev_run_todo. [*] The latter function browses the net_todo_list array and completes the registration of all its net_device instances.

net_run_todo任一时刻只能有一个 CPU 运行。序列化是通过net_todo_run_mutex互斥体强制执行的。

Only one CPU can be running net_run_todo at any one time. Serialization is enforced with the net_todo_run_mutex mutex.

设备注销的处理方式完全相同(如图8-5(b)所示)。

The unregistration of a device is handled exactly the same way (as shown in Figure 8-5(b)).

register_netdev 和 unregister_netdev 的结构

图 8-5。register_netdev 和 unregister_netdev 的结构

Figure 8-5. Structure of register_netdev and unregister_netdev

到底如何完成设备的注册或注销,分别在“ register_netdevice 函数”和“ unregister_netdevice 函数netdev_run_todo”部分的末尾进行了描述。

What netdev_run_todo does, exactly, to complete the registration or unregistration of a device is described at the end of the sections "register_netdevice Function" and "unregister_netdevice Function," respectively.

请注意,由于 处理的注册和取消注册任务netdev_run_todo不持有锁,因此该函数可以安全地休眠并保持信号量可用。您将在“引用计数”部分中看到为什么这是一件好事的一个示例。

Note that since the registration and unregistration tasks handled by netdev_run_todo do not hold the lock, this function can safely sleep and leave the semaphore available. You will see one example why this is a good thing in the section "Reference Counts."

给定图 8-5的模型,在调用时内核似乎不能有多个net_device实例。如果只添加一个实例到列表中,然后在释放锁时立即处理后者,怎么可能有多个元素呢?例如,设备驱动程序可以使用如下所示的循环一次性取消注册其所有设备(例如,参见 drivers/net/ tun.c net_todo_listnetdev_run_todoregister_netdevunregister_netdevnet_devicetun_cleanup

Given the model of Figure 8-5, it may seem that the kernel cannot have more than one net_device instance in net_todo_list by the time netdev_run_todo is called. How can there be more than one element if register_netdev and unregister_netdev add only one net_device instance to the list and then process the latter right away when releasing the lock? Well, for example, it is possible for a device driver to use a loop like the following to unregister all of its devices in one shot (see, for instance, tun_cleanup in drivers/net/tun.c):

rtnl_lock();
循环该驱动程序驱动的每个设备{
    …………
    取消注册网络设备(dev);
    …………
}
rtnl_unlock();
rtnl_lock( );
loop for each device driven by this driver {
    ... ... ...
    unregister_netdevice(dev);
    ... ... ...
}
rtnl_unlock( );

net_todo_list这比以下方法更好,后者在循环的每次迭代中获取和释放锁并进行处理:

This is better than the following approach, which gets and releases the lock and processes net_todo_list at each iteration of the loop:

循环该驱动程序驱动的每个设备{
    …………
    unregister_netdev(dev);
    …………
}
loop for each device driven by this driver {
    ... ... ...
    unregister_netdev(dev);
    ... ... ...
}

设备注册状态通知

Device Registration Status Notification

内核组件和用户空间应用程序都可能有兴趣了解网络设备何时注册、取消注册、关闭或启动。有关这些事件的通知通过两个渠道发送:

Both kernel components and user-space applications may be interested in knowing when a network device is registered, unregistered, goes down, or comes up. Notifications about these events are sent via two channels:

netdev_chain
netdev_chain

内核组件可以向该通知链注册。请参阅以下部分“ netdev_chain 通知链”。

Kernel components can register with this notification chain. See the following section, "netdev_chain notification chain."

Netlink的 RTMGRP_LINK 多播组
Netlink's RTMGRP_LINK multicast group

用户空间应用程序,例如监控工具或路由协议,可以注册到RTnetlink的RTMGRP_LINK多播组。请参阅“ RTnetlink 链接通知”部分。

User-space applications, such as monitoring tools or routing protocols, can register with RTnetlink's RTMGRP_LINK multicast group. See the section "RTnetlink link notifications."

netdev_chain 通知链

netdev_chain notification chain

我们在第 4 章中了解了通知链是什么以及如何使用它们。注册和取消注册设备的各个阶段的进度通过netdev_chain通知链进行报告。该链在net/core/dev.c中定义,对此类事件感兴趣的内核组件分别使用 和 向该链注册register_netdevice_notifier和取消注册unregister_netdevice_notifier

We saw what notification chains are and how they are used in Chapter 4. The progress through the various stages of registering and unregistering a device is reported with the netdev_chain notification chain. This chain is defined in net/core/dev.c, and kernel components interested in these kinds of events register and unregister with the chain with register_netdevice_notifier and unregister_netdevice_notifier, respectively.

NETDEV_ XXX通过报告的所有事件netdev_chain都列在include/linux/notifier.h中。以下是我们在本章中看到的内容,以及触发它们的条件:

All the NETDEV_ XXX events that are reported via netdev_chain are listed in include/linux/notifier.h. Here are the ones we have seen in this chapter, together with the conditions that trigger them:

NETDEV_UP
NETDEV_UP

NETDEV_GOING_DOWN
NETDEV_GOING_DOWN

NETDEV_DOWN
NETDEV_DOWN

NETDEV_UP发送以报告有关已启用的设备,并由 生成dev_open

NETDEV_GOING_DOWN当设备即将被禁用时发送。NETDEV_DOWN当设备被禁用时发送。它们都是由 生成的dev_close

有关这三个事件的更多详细信息,请参阅“启用和禁用网络设备”部分。

NETDEV_UP is sent to report about a device that has been enabled, and is generated by dev_open.

NETDEV_GOING_DOWN is sent when the device is about to be disabled. NETDEV_DOWN is sent when the device has been disabled. They are both generated by dev_close.

For more details on these three events, see the section "Enabling and Disabling a Network Device."

NETDEV_REGISTER
NETDEV_REGISTER

设备已注册。该事件由 生成register_netdevice请参阅“ register_netdevice 函数”部分。

The device has been registered. This event is generated by register_netdevice. See the section "register_netdevice Function."

NETDEV_UNREGISTER
NETDEV_UNREGISTER

该设备已取消注册。该事件由 生成unregister_netdevice请参阅“ unregister_netdevice 函数”部分。

The device has been unregistered. This event is generated by unregister_netdevice. See the section "unregister_netdevice Function."

这是其他的:

And here are the other ones:

NETDEV_REBOOT
NETDEV_REBOOT

由于硬件故障,设备已重新启动。目前未使用。

The device has restarted due to a hardware failure. Currently not used.

NETDEV_CHANGEADDR
NETDEV_CHANGEADDR

设备的硬件地址(或关联的广播地址)已更改。

The hardware address (or the associated broadcast address) of the device has changed.

NETDEV_CHANGENAME
NETDEV_CHANGENAME

该设备已更改名称。

The device has changed its name.

NETDEV_CHANGE
NETDEV_CHANGE

设备状态或设备配置已更改。这用于NETDEV_CHANGEADDR和未涵盖的所有情况NETDEV_CHANGENAME。当前,当 中发生变化时使用它 dev->flags

The device status or configuration of the device has changed. This is used in all the cases not covered by NETDEV_CHANGEADDR and NETDEV_CHANGENAME. It is currently used when something changes in dev->flags.

通知NETDEV_CHANGE XXX通常是为了响应用户配置更改而生成的。

The NETDEV_CHANGE XXX notifications are usually generated in response to a user configuration change.

请注意register_netdevice_notifier,在链上注册时,还会重播(仅向新注册者) 当前在系统中注册的设备的所有过去NETDEV_REGISTER和通知。NETDEV_UP这使新注册者可以清楚地了解已注册设备的当前状态。

Note that register_netdevice_notifier, when registering with the chain, also replays (to the new registrant only) all the past NETDEV_REGISTER and NETDEV_UP notifications for the devices currently registered in the system. This gives the new registrant a clear picture of the current status of the registered devices.

相当多的内核组件注册到netdev_chain. 其中包括:

Quite a few kernel components register to netdev_chain. Among them are:

路由
Routing

例如,路由子系统使用此通知来添加或删除与设备关联的所有路由条目。参见第 32 章

For instance, the routing subsystem uses this notification to add or remove all the routing entries associated with the device. See Chapter 32.

防火墙
Firewall

例如,如果防火墙已缓冲来自现在已关闭的设备的任何数据包,则它必须丢弃该数据包或根据其策略采取其他操作。

For example, if the firewall had buffered any packet from a device that now is down, it has to either drop the packet or take another action according to its policies.

协议代码(即ARP、IP等)
Protocol code (i.e., ARP, IP, etc.)

例如,当您更改本地设备的 MAC 地址时,必须相应更新 ARP 表。有关更多详细信息,请参阅相关的协议章节。

For example, when you change the MAC address of a local device, the ARP table must be updated accordingly. See the associated protocol chapters for more details.

虚拟设备
Virtual devices

请参阅“虚拟设备”部分。

See the section "Virtual Devices."

RT网联
RTnetlink

请参阅以下部分“RTnetlink 链接通知”。

See the following section, "RTnetlink link notifications."

RTnetlink 链接通知

RTnetlink link notifications

当设备状态或配置发生变化时,通知将发送RTMGRP_LINK到链路多播组 。rtmsg_ifinfo这些通知包括:

Notifications are sent to the Link multicast group RTMGRP_LINK with rtmsg_ifinfo when something changed in the device's state or configuration. Among these notifications are:

  • 当通知链上收到通知时netdev_chain。RTnetlink 注册到netdev_chain上一节中介绍的链并重播它收到的通知。

  • When a notification is received on the netdev_chain notification chain. RTnetlink registers to the netdev_chain chain introduced in the previous section and replays the notifications it receives.

  • 当禁用的设备被启用时,反之亦然(请参阅netdev_state_change)。

  • When a disabled device is enabled or vice versa (see netdev_state_change).

  • net_device->flags例如,当通过用户配置命令更改标志时(请参阅 参考资料dev_change_flags)。

  • When a flag in net_device->flags is changed, for example, via a user configuration command (see dev_change_flags).

netplugd是一个守护进程,是net-utils包的一部分,它监听这些通知并根据用户配置文件做出反应。有关详细信息,请参阅netplugs联机帮助页。

netplugd is a daemon, part of the net-utils package, that listens to these notifications and reacts according to a user configuration file. See the netplugs manpage for details.

设备注册

Device Registration

设备注册的基本模型如图8-1(a)所示,它并不是简单地将结构插入到“ net_device结构的组织net_device一节中介绍的全局列表和哈希表中。它还涉及结构中一些参数的初始化、生成广播通知以通知其他内核组件有关注册的信息以及其他任务。设备注册到,这是一个简单的包装器。包装器主要负责锁定和名称完成,如前面“分配 net_device 结构”部分中所述。锁可以保护net_deviceregister_netdevregister_netdevicedev_base已注册设备的列表。

Device registration, whose basic model is shown in Figure 8-1(a), does not consist simply of inserting the net_device structure into the global list and hash tables introduced in the section "Organization of net_device Structures." It also involves the initialization of some parameters in the net_device structure, the generation of a broadcast notification that will inform other kernel components about the registration, and other tasks. Devices are registered with register_netdev, which is a simple wrapper around register_netdevice. The wrapper mainly takes care of locking and name completion as described earlier in the section "Allocating net_device Structures." The lock protects the dev_base list of registered devices.

register_netdevice函数

register_netdevice Function

如图8-5(a)所示, register_netdevice启动设备注册并调用net_set_todo,最终要求netdev_run_todo完成注册。

As described in Figure 8-5(a), register_netdevice starts device registration and calls net_set_todo, which ultimately asks netdev_run_todo to complete the registration.

以下是执行的主要任务register_netdevice

Here are the main tasks carried out by register_netdevice:

  • 初始化一些 的net_device字段,包括用于锁定的字段,在“锁定”部分中列出。

  • Initialize some of the net_device's fields, including the ones used for locking, listed in the section "Locking."

  • 当内核支持 Divert 功能时,分配该功能所需的配置块并将其链接到dev->divert. 这是由 处理的alloc_divert_blk

  • When the kernel has support for the Divert feature, allocate a configuration block needed by the feature and link it to dev->divert. This is taken care of by alloc_divert_blk.

  • 如果设备驱动程序已初始化dev->init,则执行该函数。请参阅“虚拟设备”部分。

  • If the device driver had initialized dev->init, execute that function. See the section "Virtual Devices."

  • 使用 为设备分配唯一标识符dev_new_index。该标识符是使用计数器生成的,每次将新设备添加到系统时该计数器都会递增。该计数器是一个 32 位变量,因此dev_new_index包含一个if处理环绕的子句以及另一个 if处理变量命中已分配值的可能性的子句。

  • Assign the device a unique identifier with dev_new_index. The identifier is generated using a counter that is incremented every time a new device is added to the system. This counter is a 32-bit variable, so dev_new_index includes an if clause to handle wraparound as well as another if clause to handle the possibility that the variable hits a value that was already assigned.

  • 追加net_device到全局列表并将其插入到“ net_device 结构的组织dev_base”部分中描述的两个哈希表中。尽管在 头部添加结构 会更快,但内核有机会通过浏览整个列表来检查重复的设备名称。使用 来检查设备名称是否无效。dev_basedev_valid_name

  • Append net_device to the global list dev_base and insert it into the two hash tables described in the section "Organization of net_device Structures." Even though adding the structure at the head of dev_base would be faster, the kernel has a chance to check for duplicate device names by browsing the entire list. The device name is checked against invalid names with dev_valid_name.

  • 检查功能标志是否存在无效组合。例如:

    • 如果没有 L4 硬件校验和支持,Scather/Gather-DMA 就毫无用处,因此在这种情况下被禁用。

    • TCP 分段卸载 (TSO)需要 Scather/Gather-DMA,因此在不支持后者时禁用。

    有关 L4 校验和的更多详细信息,请参阅第 19 章。

  • Check the feature flags for invalid combinations. For example:

    • Scather/Gather-DMA is useless without L4 hardware checksumming support and is therefore disabled in that situation.

    • TCP Segmentation Offload (TSO) requires Scather/Gather-DMA, and is therefore disabled when the latter is not supported.

    See Chapter 19 for more details on L4 checksums.

  • 设置_ _LINK_STATE_PRESENT标志 dev->state以使设备对系统可用(可见和可用)。例如,当拔出热插拔设备时,或者当支持电源管理的系统进入挂起模式时,该标志会被清除。请参阅“排队规则状态”部分。

    该标志的初始化不会触发任何动作;相反,在明确定义的情况下会检查其值,以过滤掉非法请求或获取设备状态。

  • Set the _ _LINK_STATE_PRESENT flag in dev->state to make the device available (visible and usable) to the system. The flag is cleared, for example, when a hot-pluggable device is unplugged, or when a system with support for power management goes into suspend mode. See the section "Queuing Discipline State."

    The initialization of this flag does not trigger any action; instead, its value is checked in well-defined cases to filter out illegal requests or to get the device state.

  • 初始化设备的排队规则 ,由流量控制用来实现 QoS,以及 dev_init_scheduler. 排队规则定义了出口数据包如何排队到出口队列以及如何从出口队列中出队,定义了在开始丢弃数据包之前可以排队的数据包数量等。请参阅第 11 章中的“排队规则接口”部分。

  • Initialize the device's queuing discipline , used by Traffic Control to implement QoS, with dev_init_scheduler. The queuing discipline defines how egress packets are queued to and dequeued from the egress queue, defines how many packets can be queued before starting to drop them, etc. See the section "Queuing Discipline Interface" in Chapter 11.

  • 通过通知链通知所有对设备注册感兴趣的子系统netdev_chain通知链在第 4 章中描述。

  • Notify all the subsystems interested in device registration via the netdev_chain notification chain. Notification chains are described in Chapter 4.

netdev_run_todo调用完成注册时,它只是在sysfsdev->reg_state文件系统中更新和注册设备。

When netdev_run_todo is called to complete the registration, it just updates dev->reg_state and registers the device in the sysfs filesystem.

除了内存分配问题之外,仅当设备名称无效或重复,或者dev->init由于某种原因失败时,设备注册才会失败。

Aside from memory allocation problems, device registration can fail only if the device name is invalid or is a duplicate, or when dev->init fails for some reason.

设备注销

Device Unregistration

要取消注册设备,内核和关联的设备驱动程序需要撤消在其注册期间执行的所有操作,甚至更多:

To unregister a device, the kernel and the associated device driver need to undo all the operations that were executed during its registration, and more:

  • 使用 禁用设备,如“启用和禁用网络设备dev_close”部分中所述。

  • Disable the device with dev_close, described in the section "Enabling and Disabling a Network Device."

  • 释放所有分配的资源(IRQ、I/O内存、I/O端口等)

  • Release all the allocated resources (IRQ, I/O memory, I/O port, etc.)

  • net_device从全局列表中删除该结构以及“ net_device 结构的组织dev_base”部分中介绍的两个哈希表。

  • Remove the net_device structure from the global list dev_base and the two hash tables introduced in the section "Organization of net_device Structures."

  • 释放对该结构的所有引用后,释放该net_device数据结构、驱动程序的私有数据结构以及链接到它的任何其他内存块(参见图 8-2)。该net_device结构被释放free_netdev。当内核编译为支持sysfs时,free_netdev 让它负责释放结构。

  • Once all the references to the structure have been released, free the net_device data structure, the driver's private data structure, and any other memory block linked to it (see Figure 8-2). The net_device structure is freed with free_netdev. When the kernel is compiled with support for sysfs, free_netdev lets it take care of freeing the structure.

  • 删除可能已添加到/proc/sys文件系统中的任何文件。

  • Remove any file that may have been added to the /proc and /sys filesystems.

请注意,每当设备之间存在依赖关系时,取消注册其中一个可能会强制其他所有(或部分)注销。有关示例,请参阅“虚拟设备”部分。

Note that whenever there is a dependency between devices, unregistering one of them may force the unregistration of all (or part) of the others. See the section "Virtual Devices" for an example.

取消注册设备时,会出现三个函数指针net_device(由名为 的变量表示):dev

Three function pointers in net_device (represented by a variable named dev) come into the picture when unregistering a device:

dev->stop
dev->stop

该函数指针由设备驱动程序初始化为其本地例程之一。dev_stop它在禁用设备时调用(请参阅“启用和禁用网络设备”部分)。netif_stop_queue这里处理的常见任务包括使用、[ * ]停止出口队列,释放硬件资源,停止设备驱动程序使用的任何计时器等。

虚拟设备不需要释放任何硬件资源,但它们可能需要处理其他高级问题。请参阅“虚拟设备”部分。

This function pointer is initialized by the device driver to one of its local routines. It is invoked by dev_stop when disabling a device (see the section "Enabling and Disabling a Network Device"). Common tasks handled here include stopping the egress queue with netif_stop_queue,[*] releasing hardware resources, stopping any timers used by the device driver, etc.

Virtual devices do not need to release any hardware resources, but they may need to take care of other, high-level issues. See the section "Virtual Devices."

dev->uninit
dev->uninit

该函数指针也由设备驱动程序初始化为其本地例程之一。目前只有少数隧道虚拟设备对其进行初始化;他们将其指向一个主要处理引用计数的例程。

This function pointer is also initialized by the device driver to one of its local routines. Only a few, tunneling virtual devices currently initialize it; they point it to a routine that mainly takes care of reference counts.

dev->destructor
dev->destructor

使用时,通常将其初始化为其free_netdev周围的包装器。然而,destructor通常不被初始化;只有少数虚拟设备使用它。大多数设备驱动程序free_netdevunregister_netdevice.

When used, this is normally initialized to free_netdev or to a wrapper around it. However, destructor is not commonly initialized; only a few virtual devices use it. Most device drivers call free_netdev directly after unregister_netdevice.

图 8-4显示了调用这三个例程的时间和顺序。

Figure 8-4 shows when and in what order these three routines are invoked.

unregister_netdevice函数

unregister_netdevice Function

unregister_netdevicenet_device接受一个参数,即指向要删除的结构的指针:

unregister_netdevice accepts one parameter, the pointer to the net_device structure it is to remove:

int unregister_netdevice(struct net_device *dev)
int unregister_netdevice(struct net_device *dev)

第9章中,我们将详细了解网络代码如何使用软件中断(软中断)来处理数据包传输(net_tx_action)和接收(net_rx_action)。现在,您可以将这些函数视为设备驱动程序和上层协议之间的接口。两次调用synchronize_net用于unregister_netdevice与接收引擎(net_rx_action)同步,以便在更新后旧数据不会被访问unregister_netdevice

In Chapter 9, we will see in detail how the networking code uses software interrupts (softirqs) to handle packet transmission (net_tx_action) and reception (net_rx_action). You can look at those functions, for now, as the interface between device drivers and upper-layer protocols. Two calls to synchronize_net are used to synchronize unregister_netdevice with the receive engine (net_rx_action) so that it will not access old data after it has been updated by unregister_netdevice.

负责的其他任务unregister_netdevice 包括:

Other tasks taken care of by unregister_netdevice include:

  • 如果设备未禁用,则必须首先禁用它(请参阅“启用和禁用网络设备dev_close”部分)。

  • If the device was not disabled, it has to be disabled first with dev_close (see the section "Enabling and Disabling a Network Device").

  • 然后,该net_device实例将从全局列表和“ net_device 结构的组织dev_base”部分中介绍的两个哈希表中删除。请注意,这不足以禁止内核子系统使用该设备:它们可能仍然持有指向 数据结构的指针。这就是为什么使用引用计数来跟踪结构中剩余的引用数量(请参阅“引用计数”部分)。net_devicenet_device

  • The net_device instance is then removed from the global list dev_base and the two hash tables introduced in the section "Organization of net_device Structures." Note that this is not sufficient to forbid kernel subsystems from using the device: they may still hold a pointer to the net_device data structure. This is why net_device uses a reference count to keep track of how many references are left to the structure (see the section "Reference Counts").

  • 与设备相关的所有排队规则实例都将被销毁dev_shutdown

  • All the instances of queuing discipline associated with the device are destroyed with dev_shutdown.

  • 通知NETDEV_UNREGISTER在通知链上发送,netdev_chain让其他内核组件知道它。请参阅“设备注册状态通知”部分。

  • A NETDEV_UNREGISTER notification is sent on the netdev_chain notification chain to let other kernel components know about it. See the section "Device Registration Status Notification."

  • 必须通知用户空间有关取消注册的信息。例如,在具有两个可用于访问 Internet 的 NIC 的系统中,此通知可用于启动辅助设备。请参阅“设备注册状态通知”部分。

  • User space has to be notified about the unregistration. For instance, in a system with two NICs that could be used to access the Internet, this notification could be used to start the secondary device. See the section "Device Registration Status Notification."

  • 链接到该net_device 结构的任何数据块都将被释放。例如,dev->mc_list使用 删除多播数据dev_mc_discard,使用 删除转移块free_divert_blk等。未显式删除的数据 unregister_netdevice应该由处理上一个项目符号中提到的通知的函数处理程序删除。

  • Any data block linked to the net_device structure is freed. For example, the multicast data dev->mc_list is removed with dev_mc_discard, the Divert block is removed with free_divert_blk, etc. The ones that are not explicitly removed in unregister_netdevice are supposed to be removed by the function handlers that process the notifications mentioned in the previous bullet.

  • dev->initin所做的任何事情都会register_netdevice在这里被 撤销dev->uninit

  • Whatever was done by dev->init in register_netdevice is undone here with dev->uninit.

  • 绑定等功能允许您将一组设备分组在一起,并将它们视为具有特殊特性的单个虚拟设备。在这些设备中,有一个设备经常被选为主设备,因为它在组内扮演着特殊的角色。出于显而易见的原因,被删除的设备应该释放对主设备的任何引用:dev->master此时具有非 NULL 将是一个错误。如果我们坚持使用绑定示例,dev->master由于 NETDEV_UNREGISTER之前仅发送了几行代码的通知,因此引用将被清除。

  • Features such as bonding allow you to group a set of devices together and treat them as a single virtual device with special characteristics. Among those devices, one is often elected master because it plays a special role within the group. For obvious reasons, the device being removed should release any reference to the master device: having dev->master non-NULL at this point would be a bug. If we stick to the bonding example, the dev->master reference is cleared thanks to the NETDEV_UNREGISTER notifications sent just a few lines of code earlier.

最后,net_set_todo调用 let 来完成取消注册,如“拆分操作:netdev_run_todonet_run_todo ”部分中所述,并且引用计数随着 减少。从sysfs 取消注册设备,更改为,等待所有引用消失,然后通过调用 完成取消注册。dev_putnet_run_tododev->reg_stateNETREG_UNREGISTEREDdev->destructor

Finally, net_set_todo is called to let net_run_todo complete the unregistration, as described in the section "Split Operations: netdev_run_todo," and the reference count is decreased with dev_put. net_run_todo unregisters the device from sysfs, changes dev->reg_state to NETREG_UNREGISTERED, waits until all the references are gone, and completes the unregistration with a call to dev->destructor.

参考计数

Reference Counts

net_device在释放对结构的所有引用之前,无法释放结构。该结构的引用计数保存在 中,每次使用和dev->refcnt分别添加或删除引用时都会更新该引用计数。dev_holddev_put

net_device structures cannot be freed until all the references to it are released. The reference count for the structure is kept in dev->refcnt, which is updated every time a reference is added or removed, respectively, with dev_hold and dev_put.

当设备用 注册时register_netdevicedev->refcnt被初始化为 1。因此,第一个引用由负责网络设备数据库的内核代码保存。仅当调用 时才会释放此引用unregister_netdevice。这意味着 dev->refcnt在设备取消注册之前,它永远不会降至零。因此,与引用计数降至零时例程 释放的其他内核对象不同,数据结构在从内核取消注册设备之前不会释放。我们已经在“设备注销时”部分中了解了导致设备注销的条件。xxx _putnet_device

When a device is registered with register_netdevice, dev->refcnt is initialized to 1. This first reference is therefore kept by the kernel code that is responsible for the network devices database. This reference will be released only with a call to unregister_netdevice. This means that dev->refcnt will never drop to zero until the device is to be unregistered. Therefore, unlike other kernel objects that are freed by the xxx _put routine when the reference count drops to zero, net_device data structures are not freed until you unregister the device from the kernel. We saw already the conditions that lead to the unregistration of a device in the section "When a Device Is Unregistered."

dev_put总之,对末尾的 调用unregister_netdevice不足以使 net_device实例符合删除条件:内核仍需要等待所有引用都被释放。但由于设备在注销后就不再可用,因此内核需要通知所有引用持有者,以便他们可以释放引用。这是通过 NETDEV_UNREGISTERnetdev_chain通知链发送通知来完成的。这也意味着引用持有者应该注册到通知链;否则,他们将无法收到此类通知并采取相应行动。

In summary, the call to dev_put at the end of unregister_netdevice is not sufficient to make a net_device instance eligible for deletion: the kernel still needs to wait until all the references are released. But because the device is no longer usable after it is unregistered, the kernel needs to notify all the reference holders so that they can release their references. This is done by sending a NETDEV_UNREGISTER notification to the netdev_chain notification chain. This also means that reference holders should register to the notification chain; otherwise, they will not be able to receive such notifications and take action accordingly.

正如我们在“拆分操作:netdev_run_todo ”部分中提到的,unregister_netdevice启动取消注册过程并让我们netdev_run_todo完成它。netdev_run_todo调用netdev_wait_allrefs无限期等待,直到该net_device结构的所有引用都被释放。下一节将详细介绍 的内部结构netdev_wait_allrefs

As we mentioned in the section "Split Operations: netdev_run_todo," unregister_netdevice starts the unregistration process and lets netdev_run_todo complete it. netdev_run_todo calls netdev_wait_allrefs to indefinitely wait until all references to the net_device structure have been released. The next section goes into detail on the internals of netdev_wait_allrefs.

函数netdev_wait_allrefs

Function netdev_wait_allrefs

netdev_wait_allrefs,如图 8-6所示,由一个循环组成,仅当 的值dev->refcnt降至零时才结束。它每秒发出一条NETDEV_UNREGISTER 通知,每 10 秒在控制台上打印一条警告。其余时间它都在睡觉。net_device在释放对输入结构的所有引用之前,该函数不会放弃 。

netdev_wait_allrefs, depicted in Figure 8-6, consists of a loop that ends only when the value of dev->refcnt drops to zero. Every second it sends out a NETDEV_UNREGISTER notification, and every 10 seconds it prints a warning on the console. The rest of the time it sleeps. The function does not give up until all the references to the input net_device structure have been released.

需要发送多个通知的两种常见情况是:

Two common cases that would require more than one notification to be sent are:

一个错误
A bug

例如,一段代码可以保存对net_device结构的引用,但它可能不会释放它们,因为它尚未注册到netdev_chain通知链,或者因为它没有正确处理通知。

For example, a piece of code could hold references to net_device structures, but it may not release them because it has not registered to the netdev_chain notification chain, or because it does not process notifications correctly.

待处理的计时器
A pending timer

例如,假设某个定时器到期时执行的例程需要访问包括对net_device结构的引用的数据。在这种情况下,您需要等到计时器到期并且其处理程序有望释放其引用。

For example, suppose the routine that is executed when some timer expires needs to access data that includes references to net_device structures. In this case, you would need to wait until the timer expires and its handler hopefully releases its references.

请注意,由于它netdev_run_todo是在释放锁时启动的 ,如“分割操作:netdev_run_todounregister_netdevice ”部分中所述,这意味着无论是谁启动了取消注册,很可能是驱动程序,都将进入睡眠状态,等待完成其工作。netdev_run_todo

Note that since netdev_run_todo is started by unregister_netdevice when it releases the lock, as described in the section "Split Operations: netdev_run_todo," it means that whoever started the unregistration, most probably the driver, is going to sleep waiting for netdev_run_todo to complete its job.

当函数发送通知时,它还会处理挂起的链接状态更改事件。链路状态更改事件在“链路状态更改检测”部分中介绍这里,我只想说,当设备被注销时,内核在获知设备上的链接状态更改事件时不需要执行任何操作。当当前设备状态是设备即将被删除时, 与被删除的设备相关的事件在处理链路状态改变事件列表时与空操作相关联,因此结果是事件列表被清除并且只实际处理其他设备的事件。这只是一种简单的方法清除与即将消失的设备关联的事件的链路状态更改队列。

When the function sends the notification, it also processes the pending link state change events. Link state change events are covered in the section "Link State Change Detection." Here, suffice it to say that when a device is being unregistered, the kernel does not need to do anything when informed about a link state change event on the device. When the current device state is that the device is about to be removed, events associated with devices being removed are associated with no-ops when the link state change event list is processed, so the result is that the event list is cleared and only events for other devices are actually processed. This is just an easy way to clean up the link state change queue from events associated with a device about to disappear.

启用和禁用网络设备

Enabling and Disabling a Network Device

设备注册后即可使用,但在用户(或用户空间应用程序)明确启用之前,它不会传输和接收流量。启用设备的请求由net/core/dev.cdev_open中定义的处理 。启用设备包括以下任务:

Once a device has been registered it is available for use, but it will not transmit and receive traffic until it is explicitly enabled by the user (or a user-space application). Requests to enable a device are taken care of by dev_open, defined in net/core/dev.c. Enabling a device consists of the following tasks:

  • dev->open如果已定义则调用。并非所有设备驱动程序都会初始化此函数。

  • Call dev->open if it is defined. Not all device drivers initialize this function.

  • 设置_ _LINK_STATE_START标志dev->state以将设备标记为已启动并正在运行。

    函数netdev_wait_allrefs

    图 8-6。函数netdev_wait_allrefs

  • Set the _ _LINK_STATE_START flag in dev->state to mark the device as up and running.

    Figure 8-6. Function netdev_wait_allrefs

  • 设置IFF_UP标志dev->flags以将设备标记为已启动。

  • Set the IFF_UP flag in dev->flags to mark the device as up.

  • 调用dev_activate以初始化流量控制使用的出口排队规则,并启动看门狗计时器。[ * ]如果没有流量控制的用户配置,则分配默认的先进先出 (FIFO) 队列。

  • Call dev_activate to initialize the egress queuing discipline used by Traffic Control, and start the watchdog timer.[*] If there is no user configuration for Traffic Control, assign a default First In, First Out (FIFO) queue.

  • NETDEV_UP向通知链发送通知netdev_chain以通知感兴趣的内核组件该设备现已启用。

  • Send a NETDEV_UP notification to the netdev_chain notification chain to notify interested kernel components that the device is now enabled.

虽然设备需要显式启用,但它可以通过用户命令显式禁用或通过其他事件隐式禁用。例如,在取消注册设备之前,首先将其禁用(请参阅“设备取消注册”部分)。网络设备被禁用dev_close。禁用设备包括以下任务:

While a device needs to be explicitly enabled, it can be disabled either explicitly by a user command or implicitly by other events. For example, before a device is unregistered, it is first disabled (see the section "Device Unregistration"). Network devices are disabled with dev_close. Disabling a device consists of the following tasks:

  • NETDEV_GOING_DOWN向 通知链发送通知netdev_chain,通知感兴趣的内核组件该设备即将被禁用。

  • Send a NETDEV_GOING_DOWN notification to the netdev_chain notification chain to notify interested kernel components that the device is about to be disabled.

  • 调用dev_deactivate以禁用出口排队规则,从而确保设备不能再用于传输,并停止看门狗定时器,因为不再需要它。

  • Call dev_deactivate to disable the egress queuing discipline, thus making sure the device cannot be used for transmission anymore, and stop the watchdog timer because it is not needed anymore.

  • 清除_ _LINK_STATE_START标志以dev->state将设备标记为关闭。

  • Clear the _ _LINK_STATE_START flag in dev->state to mark the device as down.

  • 如果计划轮询操作来读取设备上的入口数据包,请等待该操作完成。由于该_ _LINK_STATE_START标志已被清除,因此不会在设备上安排更多的接收轮询,但在清除该标志之前可能已等待接收轮询。有关接收轮询的更多详细信息,请参阅第 10 章。

  • If a polling action was scheduled to read ingress packets on the device, wait for that action to complete. Because the _ _LINK_STATE_START flag has been cleared, no more receive polling will be scheduled on the device, but one could have been pending before the flag was cleared. See Chapter 10 for more detail on receive polling.

  • dev->stop如果已定义则调用。并非所有设备驱动程序都会初始化此函数。

  • Call dev->stop if it is defined. Not all device drivers initialize this function.

  • 清除IFF_UP标志以dev->flags将设备标记为关闭。

  • Clear the IFF_UP flag in dev->flags to mark the device as down.

  • NETDEV_DOWN向通知链发送通知netdev_chain以通知感兴趣的内核组件该设备现已禁用。

  • Send a NETDEV_DOWN notification to the netdev_chain notification chain to notify interested kernel components that the device is now disabled.

更新设备排队规则状态

Updating the Device Queuing Discipline State

我们在“排队规则状态”部分看到可以设置哪些标志dev->state来定义设备排队规则状态。在本节中,我们将了解如何使用其中两个标志来处理电源管理和链路状态改变。

We saw in the section "Queuing Discipline State" which flags can be set in dev->state to define the device queuing discipline state. In this section, we will see how two of those flags are used to handle power management and link state changes.

与电源管理的交互

Interactions with Power Management

当内核支持电源管理时,当系统进入挂起模式、恢复等时,NIC 设备驱动程序可以得到通知。我们在第 6 章的“PCI NIC 驱动程序注册示例”一节中看到了和 的功能结构体指针的 初始化取决于内核是否支持电源管理。例如,drivers/net/3c59x.c设备驱动程序是如何初始化其实例的:suspendresumepci_driverpci_driver

When the kernel has support for power management, NIC device drivers can be notified when the system goes into suspend mode, when it is resumed, etc. We saw in the section "Example of PCI NIC Driver Registration" in Chapter 6 how the suspend and resume function pointers of the pci_driver structures are initialized depending on whether the kernel has support for power management. This is, for example, how the drivers/net/3c59x.c device driver initializes its pci_driver instance:

静态结构 pci_driver vortex_driver = {
    .name“3c59x”,
    .probe vortex_init_one,
    .remove_ _devexit_p(vortex_remove_one),
    .id_table vortex_pci_tbl,
#ifdef CONFIG_PM
    .挂起涡流_挂起,
    .resume vortex_resume,
#万一
};
static struct pci_driver vortex_driver = {
    .name        "3c59x",
    .probe        vortex_init_one,
    .remove        _ _devexit_p(vortex_remove_one),
    .id_table    vortex_pci_tbl,
#ifdef CONFIG_PM
    .suspend    vortex_suspend,
    .resume        vortex_resume,
#endif
};

当系统进入挂起模式时,suspend设备驱动程序提供的例程将被执行,让驱动程序采取相应的操作。电源管理状态改变不影响注册状态 dev->reg_state,但设备状态dev->state需要改变。

When the system goes into suspend mode, the suspend routines provided by device drivers are executed to let drivers take action accordingly. Power management state changes do not affect the registration status dev->reg_state, but the device state dev->state needs to be changed.

暂停设备

Suspending a device

当设备被挂起时,其设备驱动程序通过调用例如 PCI 设备的 例程来处理该pci_driver事件suspend。除了驱动程序特定的操作之外,每个设备驱动程序还必须执行一些附加操作:

When a device is suspended, its device driver handles the event, by calling, for example, the pci_driver's suspend routine for PCI devices. Besides the driver-specific actions, a few additional actions must be performed by every device driver:

  • 清除该_ _LINK_STATE_PRESENT标志, dev->state因为设备暂时无法运行。

  • Clear the _ _LINK_STATE_PRESENT flag from dev->state because the device is temporarily not going to be operational.

  • 如果设备已启用,请使用netif_stop_queue [ * ]禁用其出口队列,以防止该设备被用于传输任何其他数据包。请注意,注册的设备不一定是启用的:当设备被识别时,它会被内核分配给其设备驱动程序并被注册;然而,在明确的用户配置请求之前,该设备不会被启用(因此可用)。

  • If the device was enabled, disable its egress queue with netif_stop_queue [*] to prevent the device from being used to transmit any other packet. Note that a device that is registered is not necessarily enabled: when a device is recognized, it gets assigned to its device driver by the kernel and is registered; however, the device will not be enabled (and therefore usable) until an explicit user configuration requests it.

这些任务通过以下方式简洁地实现netif_device_detach

These tasks are succinctly implemented by netif_device_detach:

静态内联无效 netif_device_detach(struct net_device *dev)
{
    if (test_and_clear_bit(_ _LINK_STATE_PRESENT, &dev->state) &&
        netif_running(dev)) {
        netif_stop_queue(dev);
    }
}
static inline void netif_device_detach(struct net_device *dev)
{
    if (test_and_clear_bit(_ _LINK_STATE_PRESENT, &dev->state) &&
        netif_running(dev)) {
        netif_stop_queue(dev);
    }
}

恢复设备

Resuming a device

当设备恢复时,其设备驱动程序通过调用例如 PCI 设备的 例程来处理该pci_driver事件 resume。同样,一些任务是由所有设备驱动程序共享的:

When a device is resumed, its device driver handles the event, by calling, for example, the pci_driver's resume routine for PCI devices. Again, a few tasks are shared by all device drivers:

  • 设置该_ _LINK_STATE_PRESENT标志是 dev->state因为该设备现在再次可用。

  • Set the _ _LINK_STATE_PRESENT flag in dev->state because the device is now available again.

  • 如果设备在挂起之前已启用,请使用 重新启用其出口队列,并重新启动流量控制使用的看门狗定时器(请参阅第 11 章中的“看门狗定时器netif_wake_queue部分)。

  • If the device was enabled before being suspended, re-enable its egress queue with netif_wake_queue, and restart a watchdog timer used by Traffic Control (see the section "Watchdog timer" in Chapter 11).

这些任务由以下人员执行netif_device_attach

These tasks are implemented by netif_device_attach:

静态内联无效 netif_device_attach(struct net_device *dev)
{
    if (!test_and_set_bit(_ _LINK_STATE_PRESENT, &dev->state) &&
        netif_running(dev)) {
        netif_wake_queue(dev);
        __netdev_watchdog_up(dev);
    }
}
static inline void netif_device_attach(struct net_device *dev)
{
    if (!test_and_set_bit(_ _LINK_STATE_PRESENT, &dev->state) &&
        netif_running(dev)) {
        netif_wake_queue(dev);
        _ _netdev_watchdog_up(dev);
    }
}

链路状态变化检测

Link State Change Detection

当 NIC 设备驱动程序检测到载波或信号存在或不存在时,无论是因为 NIC 通知它还是通过读取 NIC 上的配置寄存器进行显式检查,它都可以分别用 和 通知netif_carrier_on内核netif_carrier_off。当载体状态发生变化时将调用这些例程;因此,当它们被不恰当地调用时,它们什么也不做。

When an NIC device driver detects the presence or absence of a carrier or signal, either because it was notified by the NIC or via an explicit check by reading a configuration register on the NIC, it can notify the kernel with netif_carrier_on and netif_carrier_off, respectively. These routines are to be called when there is a change in the carrier status; therefore, they do nothing when they are invoked inappropriately.

以下是一些可能导致链路状态更改的常见情况:

Here are a few common cases that may lead to a link state change:

  • 将电缆插入 NIC 或从 NIC 拔出。

  • A cable is plugged into or unplugged from an NIC.

  • 电缆另一端的设备已断电或禁用。设备示例包括集线器、网桥、路由器和 PC NIC。

  • The device at the other end of the cable is powered down or disabled. Examples of devices include hubs, bridges, routers, and PC NICs.

netif_carrier_on由已检测到其设备之一上的运营商的设备驱动程序调用时,该函数:

When netif_carrier_on is called by a device driver that has detected the carrier on one of its devices, the function:

  • 清除_ _LINK_STATE_NOCARRIER标志 dev->state

  • Clears the _ _LINK_STATE_NOCARRIER flag from dev->state.

  • 生成链接状态更改事件并将其提交以进行处理linkwatch_fire_event请参阅“调度和处理链路状态更改事件”部分。

  • Generates a link state change event and submits it for processing with linkwatch_fire_event. See the section "Scheduling and processing link state change events."

  • 如果设备已启用,则启动看门狗定时器。流量控制使用计时器来检测传输是否失败并卡住(在这种情况下计时器超时)。请参见第11 章中的“看门狗定时器”部分。

    静态内联 netif_carrier_on(struct net_device *dev)
    {
        if (test_and_clear_bit(_ _LINK_STATE_NOCARRIER, &dev->state))
            linkwatch_fire_event(dev);
        如果(netif_运行(dev)
            __netdev_watchdog_up(dev);
    }
  • If the device was enabled, starts a watchdog timer. The timer is used by Traffic Control to detect whether a transmission fails and gets stuck (in which case the timer times out). See the section "Watchdog timer" in Chapter 11.

    static inline netif_carrier_on(struct net_device *dev)
    {
        if (test_and_clear_bit(_ _LINK_STATE_NOCARRIER, &dev->state))
            linkwatch_fire_event(dev);
        if (netif_running(dev)
            _ _netdev_watchdog_up(dev);
    }

netif_carrier_off设备驱动程序检测到其设备之一丢失载波时调用该函数:

When netif_carrier_off is called by a device driver that has detected the loss of a carrier from one of its devices, the function:

请注意,这两个例程都会生成一个链路状态更改事件,并将其提交以进行处理linkwatch_fire_event,如下一节所述。

Note that both routines generate a link state change event and submit it for processing with linkwatch_fire_event, described in the next section.

静态内联 netif_carrier_off(struct net_device *dev)
{
    if (!test_and_set_bit(_ _LINK_STATE_NOCARRIER, &dev->state))
        linkwatch_fire_event(dev);
}
static inline netif_carrier_off(struct net_device *dev)
{
    if (!test_and_set_bit(_ _LINK_STATE_NOCARRIER, &dev->state))
        linkwatch_fire_event(dev);
}

调度和处理链路状态变化事件

Scheduling and processing link state change events

链路状态改变事件是用 lw_event结构体定义的。这是一个非常简单的结构:它只包含一个指向关联net_device结构的指针和另一个用于将该结构链接到挂起的链接状态更改事件的全局列表的字段lweventlist。该列表受lweventlist_lock锁保护。

Link state change events are defined with lw_event structures. It's a pretty simple structure: it includes just a pointer to the associated net_device structure and another field used to link the structure to the global list of pending link state change events, lweventlist. The list is protected by the lweventlist_lock lock.

请注意,该lw_event结构不包括任何区分载波检测和丢失的参数。这是因为不需要区分。内核需要知道的只是链接状态发生了变化,因此对设备的引用就足够了。对于任何设备,永远不会有多个lw_event实例lweventlist,因为没有理由记录历史记录或跟踪更改:链接要么运行,要么不运行,因此链接状态要么打开,要么关闭。两次状态更改等于无更改,三个更改等于一等等,因此当设备已有待处理的链路状态更改事件时,新事件不会排队。可以通过检查来检测该情况_ _LINK_STATE_LINKWATCH_PENDING中的标志,如图8-7dev->state所示流程图。

Note that the lw_event structure does not include any parameter to distinguish between detection and loss of carrier. This is because no differentiation is needed. All the kernel needs to know is that there was a change in the link status, so a reference to the device is sufficient. There will never be more than one lw_event instance in lweventlist for any device, because there's no reason to record a history or track changes: either the link is operational or it isn't, so the link state is either on or off. Two state changes equal no change, three changes equal one, etc., so new events are not queued when the device already has a pending link state change event. The condition can be detected by checking the _ _LINK_STATE_LINKWATCH_PENDING flag in dev->state, as shown in the flowchart in Figure 8-7.

linkwatch_fire_event函数

图 8-7。linkwatch_fire_event函数

Figure 8-7. linkwatch_fire_event function

一旦使用lw_event对正确实例的引用初始化数据结构net_device并将其添加到lweventlist列表中,并且_ _LINK_STATE_LINKWATCH_PENDING已在 中设置标志dev->statelinkwatch_fire_event则需要启动实际处理列表中元素的例程lweventlist。该例程linkwatch_event不是直接调用的。它通过向内核线程提交请求来安排执行keventd_wqwork_struct使用例程的引用初始化数据结构linkwatch_event并提交给keventd_wq.

Once the lw_event data structure has been initialized with a reference to the right net_device instance and it has been added to the lweventlist list, and the _ _LINK_STATE_LINKWATCH_PENDING flag has been set in dev->state, linkwatch_fire_event needs to launch the routine that will actually process the elements on the lweventlist list. This routine, linkwatch_event, is not called directly. It is scheduled for execution by submitting a request to the keventd_wq kernel thread: a work_struct data structure is initialized with a reference to the linkwatch_event routine and is submitted to keventd_wq.

为了避免处理例程linkwatch_event运行过于频繁,其执行速率限制为每秒一次。

To avoid having the processing routine linkwatch_event run too often, its execution is rate limited to once per second.

linkwatch_event在“锁定”部分中描述的锁的保护下,使用 处理lweventlist列表的元素。处理实例简单地包括:linkwatch_run_queuertnllw_event

linkwatch_event processes the elements of the lweventlist list with linkwatch_run_queue, under the protection of the rtnl lock described in the section "Locking." Processing lw_event instances consists simply of:

  • 清除_ _LINK_STATE_LINKWATCH_PENDING上的标志dev->state

  • Clearing the _ _LINK_STATE_LINKWATCH_PENDING flag on dev->state.

  • NETDEV_CHANGEnetdev_chain通知链上发送通知

  • Sending a NETDEV_CHANGE notification on the netdev_chain notification chain

  • RTM_NEWLINK向 RTnetlink 组发送通知RTMGRP_LINK请参阅“ RTnetlink 链接通知”部分。

  • Sending an RTM_NEWLINK notification to the RTMGRP_LINK RTnetlink group. See the section "RTnetlink link notifications."

这两个通知均通过 发送netdev_state_change,但仅在设备启用时 ( dev->flags & IFF_UP) 发送:没有人关心禁用设备上的链路状态更改。

The two notifications are sent with netdev_state_change, but only when the device is enabled (dev->flags & IFF_UP): no one cares about link state changes on disabled devices.

Linkwatch 标志

Linkwatch flags

net/core/linkwatch.c中的代码定义了两个可以在全局变量中设置的标志linkwatch_flags

The code in net/core/linkwatch.c defines two flags that can be set in the global variable linkwatch_flags:

LW_RUNNING
LW_RUNNING

当这个标志被设置时,linkwatch_event 就已经被安排执行了。该标志自行清除linkwatch_event

When this flag is set, linkwatch_event has been scheduled for execution. The flag is cleared by linkwatch_event itself.

LW_SE_USED
LW_SE_USED

由于lweventlist通常最多有一个元素,因此代码lw_event通过静态分配一个元素并始终将其用作列表的第一个元素来优化数据结构分配。仅当内核需要跟踪多个待处理事件(多个设备上的事件)时,它才会分配额外的 lw_event结构;否则,它只是回收同一个。

Because lweventlist usually has at most one element, the code optimizes lw_event data structure allocations by statically allocating one and always using it as the first element of the list. Only when the kernel needs to keep track of more than one pending event (events on more than one device) does it allocate additional lw_event structures; otherwise, it simply recycles the same one.

从用户空间配置设备相关信息

Configuring Device-Related Information from User Space

可以使用不同的工具来配置或转储网络设备的媒体和硬件参数的当前状态。其中包括:

Different tools can be used to configure or dump the current status of media and hardware parameters for network devices. Among them are:

  • ifconfigmii-tool,来自net-tools

  • ifconfig and mii-tool, from the net-tools package

  • ethtool,来自ethtool

  • ethtool, from the ethtool package

  • ip link,来自IPROUTE2

  • ip link, from the IPROUTE2 package

您可以参考相关的联机帮助页以获取有关这些命令语法的详细信息。“ Ethtool ”部分描述了ethtool和内核之间的接口,“媒体独立接口(MII) ”部分描述了mii-tool和内核之间的接口。后面的章节将返回到用于 L3 配置的ifconfigip命令。

You can refer to the associated manpages for details on the syntax of those commands. The section "Ethtool" describes the interface between ethtool and the kernel, and the section "Media Independent Interface (MII)" describes the interface between mii-tool and the kernel. Later chapters return to the ifconfig and ip commands for the L3 configuration.

图 8-8是我们将在这些部分中介绍的内容的高级概述。该图未显示锁定细节。可以说,两者dev_ethtool和对的调用都 dev->do_ioctl受到路由 Netlink 锁的保护(请参阅“锁定”部分)。

Figure 8-8 is a high-level overview of what we will cover in these sections. The figure does not show the locking details. Suffice it to say that both dev_ethtool and the call to dev->do_ioctl are protected with the routing Netlink lock (see the section "Locking").

以太网工具

Ethtool

本节概述 ethtool及其与mii-tooldo_ioctl的关系以及net_device.

This section gives an overview of ethtool along with its relationship to mii-tool and the do_ioctl function pointer in net_device.

net_device数据结构包括一个指向 类型的 VFT 的指针ethtool_ops。后一个结构是函数指针的集合,可用于读取和初始化结构上的一堆参数net_device,或触发操作(即重新启动自动协商)。

The net_device data structure includes a pointer to a VFT of type ethtool_ops. The latter structure is a collection of function pointers that can be used to both read and initialize a bunch of parameters on the net_device structure, or to trigger an action (i.e., restart auto-negotiation).

当前并非所有设备驱动程序都支持此功能;那些支持它的人并不总是支持它的所有功能。的初始化通常在本章开头介绍的例程dev->ethtool_ops中完成。probe

Not all device drivers currently support this feature; and those that do support it don't always support all of its functions. The initialization of dev->ethtool_ops is normally done in the probe routine introduced at the beginning of the chapter.

用户空间和函数之间的接口是旧的ioctl系统调用。图 8-8显示了用户空间命令ethtooldev_ethtool最终如何在内核端调用。该图还显示了 的框架dev_ethtool,以及该函数如何连接到通用媒体独立接口内核库。我们将解决“媒体独立接口 (MII) ”部分中的最后一点。

The interface between user space and the functions is the old ioctl system call. Figure 8-8 shows how the user-space command ethtool ends up invoking dev_ethtool on the kernel side. The figure also shows the skeleton of dev_ethtool, and how this function interfaces to the generic Media Independent Interface Kernel library. We will address the last point in the section "Media Independent Interface (MII)."

无需详细介绍内核如何将ioctl命令分派给正确的处理程序,我只想说请求首先到达inet_ioctl,它调用 dev_ioctl,最终调用dev_ethtool。(你可以浏览一下代码,一步步看看它是如何工作的;代码非常清晰。)

Without going into too much detail on how the kernel dispatches ioctl commands to the right handlers, I'll just say that the request first arrives to inet_ioctl, which invokes dev_ioctl, which ends up calling dev_ethtool. (You can browse the code and see how it works step by step; the code is pretty clear.)

用于设备配置的 ioctl 接口

图 8-8。用于设备配置的 ioctl 接口

Figure 8-8. ioctl interface for device configuration

dev_ethtool在持有路由 Netlink 锁的情况下运行(请参阅“锁定”部分)。该函数首先进行一些健全性检查。然后,根据通过ifreq数据结构从用户空间接收的命令类型,它调用正确的帮助程序ethtool_ xxx,该程序由围绕虚拟函数的简单包装器组成dev->ethtool_ops-> xxx。由于支持 Ethtool 的驱动程序不一定支持所有ethtool_ops 功能,因此辅助例程可能会返回 - EOPNOTSUPP (不支持的操作)。图 8-9中未显示这一点。

dev_ethtool runs with the routing Netlink lock held (see the section "Locking"). The function starts with a few sanity checks. Then, based on the command type received from user space via an ifreq data structure, it invokes the right helper routine ethtool_ xxx, which consists of a simple wrapper around a dev->ethtool_ops-> xxx virtual function. Because a driver that supports Ethtool does not necessarily support all the ethtool_ops functions, the helper routine can return -EOPNOTSUPP (operation not supported). This is not shown in Figure 8-9.

另请注意,在支持例程执行之前和之后分别dev_ethtool调用ethtool_ops函数begin和。然而,这些函数是可选的,因此只有在设备驱动程序提供的情况下才会被调用。使用它们的驱动程序并不多,也有可能一个驱动程序只使用一个。某些 PCI NIC 设备驱动程序使用它们在发送命令之前为 NIC 加电(如果 NIC 已关闭),然后再次关闭。completeethtool_ xxx

Note also that dev_ethtool calls the ethtool_ops functions begin and complete, respectively, before and after the execution of the ethtool_ xxx support routine. Those functions, however, are optional, and therefore are invoked only if provided by the device driver. Not many drivers use them, and it is also possible for a driver to use only one. Some PCI NIC device drivers use them to power up the NIC before sending it the command (if the NIC is powered down) and then to power it down again.

辅助例程的骨架ethtool_ xxx非常简单:将数据从用户空间移动到内核空间(反之亦然,如果它是“get”命令),并调用其中一个函数 ethtool_ops

The skeleton of an ethtool_ xxx helper routine is pretty simple: move data from user space to kernel space (or vice versa, if it is a "get" command), and call one of the ethtool_ops functions.

不支持 ethtool 的驱动程序

Drivers that do not support ethtool

dev_ethtool调用它来处理其驱动程序不支持 Ethtool 的设备的命令时,它会尝试让驱动程序通过该dev->do_ioctl函数处理该命令。驱动程序也可能不支持后者。在这种情况下,dev_ethtool返回 - EOPNOTSUPP

When dev_ethtool is called to process a command for a device whose driver does not support Ethtool, it tries to let the driver process the command via the dev->do_ioctl function. It is possible that the driver does not support the latter either. In such a case, dev_ethtool returns -EOPNOTSUPP.

也可以do_ioctl发出回调(如图 8-8dev_ethtool中的虚线所示):例如,这是由虚拟设备完成的,虚拟设备只是想让关联的真实设备的设备驱动程序负责该命令的(请参阅net/8021q/vlan_dev.c 中的示例)。vlan_dev_ioctl

It is also possible for do_ioctl to issue a call back to dev_ethtool (as shown with a dotted line in Figure 8-8): this is done, for instance, by virtual devices that simply want to let the device driver of the associated real device take care of the command (see vlan_dev_ioctl in net/8021q/vlan_dev.c for an example).

媒体独立接口 (MII)

Media Independent Interface (MII)

MII是IEEE标准规范,描述网络控制器芯片和物理媒体芯片之间的接口。通过此界面,用户可以启用、禁用和配置自动协商等。并非所有 NIC 都有它。

MII is an IEEE standard specification that describes the interface between network controller chips and physical media chips. With this interface, the user can, for instance, enable, disable, and configure auto-negotiation. Not all NICs have it.

在 Linux 上与 MII 交互的最常用工具是mii-tools与ethtool一样,它通过 与内核交互,如图8-8ioctl所示 。内核提供了一组命令来处理 MII。这些命令主要包括对特定网卡寄存器的读写操作。ioctl

The most common tool used to interact with MII on Linux is mii-tools. Like ethtool, this interacts with the kernel via ioctl, as shown in Figure 8-8. The kernel provides a set of ioctl commands to handle MII. These commands consist mainly of read and write operations on specific NIC registers.

如图8-8所示, ioctl命令被传递给dev->do_ioctl设备驱动程序提供的函数。该函数可以通过以下两种方式之一处理它们:

As shown in Figure 8-8, the ioctl commands are passed to the dev->do_ioctl function provided by the device driver. The function can handle them in one of two ways:

  • 仅识别三个 MIIioctl命令并使用设备驱动程序代码处理它们。这是最常见的情况。

  • Recognize only the three MII ioctl commands and process them with device driver code. This is the most common case.

  • 依赖内核MII库drivers/net/mii.c,通过处理输入命令generic_mii_ioctl

  • Rely on the kernel MII library drivers/net/mii.c by processing the input command with generic_mii_ioctl.

也是可以的,特别是对于虚拟设备 ,具有dev->do_ioctl识别和处理除 MII 之外的其他命令的功能。

It is also possible, especially for virtual devices , to have dev->do_ioctl functions that recognize and process other commands besides the MII ones.

dev->do_ioclt对于那些依赖内核 MII 库且不实现特殊命令的驱动程序,以下是该函数的通用模型:

The following is a common model for the dev->do_ioclt function for those drivers that rely on the kernel MII library and do not implement special commands:

如果(!netif_running(dev)){
    返回-EINVAL;
}
<锁定私有数据结构>
错误 = generic_mii_ioctl(...);
<解锁私有数据结构>
返回错误;
if (!netif_running(dev)) {
    return -EINVAL;
}
<lock private data structure>
err = generic_mii_ioctl(...);
<unlock private data structure>
return err;

请注意,在图 8-8中, ethtool命令可能最终会调用 MII 内核库中的例程(例如,重新启动自动协商)。

Note in Figure 8-8 that an ethtool command may end up invoking a routine from the MII kernel library (for example, to restart auto-negotiation).

虚拟设备

Virtual Devices

在第 5 章的“虚拟设备”部分中,我们看到了虚拟设备与真实设备在初始化方面的不同之处。就注册而言,虚拟设备需要像真实设备一样注册并启用才能使用。然而,也存在差异:

In the section "Virtual Devices" in Chapter 5, we saw how virtual devices differ from real ones with regard to initialization. As far as registration is concerned, virtual devices need to be registered and enabled just like real ones, to be used. However, there are differences:

  • 虚拟设备有时会调用register_netdeviceandunregister_netdevice而不是它们的包装器,并自行处理锁定。他们可能需要处理锁定,以比真实设备保持锁定的时间稍长一些。register_netdev使用这种方法,通过使其保护可以通过其他方式保护的其他代码段(除了 ),锁也可能被滥用并保持比所需的时间更长的时间。

  • Virtual devices sometimes call register_netdevice and unregister_netdevice rather than their wrappers, and take care of locking by themselves. They may need to handle locking to keep the lock for a little longer than a real device does. With this approach, the lock could also be misused and hold longer than needed, by making it protect additional pieces of code (besides register_netdev) that could be protected in other ways.

  • 真实设备无法通过用户命令取消注册(即销毁);它们只能被禁用。真实设备在卸载其驱动程序时会取消注册(当然,当作为模块加载时)。相比之下,虚拟设备也可以通过用户命令来创建和取消注册。这是否可行取决于虚拟设备驱动程序的设计。

  • Real devices cannot be unregistered (i.e., destroyed) with user commands; they can only be disabled. Real devices are unregistered at the time their drivers are unloaded (when loaded as modules, of course). Virtual devices, in contrast, may be created and unregistered with user commands, too. Whether this is possible depends on the virtual device driver's design.

我们还在“ register_netdevice 函数”和“设备注销”部分中看到,虚拟设备与大多数真实设备不同,使用dev->initdev->uninitdev->destructor。因为大多数虚拟设备在真实设备之上实现某种或多或少复杂的逻辑,所以它们使用dev->initdev->uninit来处理额外的初始化和清理。dev->destructor通常会初始化为 (如图8-4free_netdev所示),以便驱动程序在注销后不需要显式调用后一个函数。

We also saw in the sections "register_netdevice Function" and "Device Unregistration" that virtual devices, unlike most real ones, use dev->init, dev->uninit, and dev->destructor. Because most virtual devices implement some kind of more or less complex logic on top of real devices, they use dev->init and dev->uninit to take care of extra initialization and cleanup. dev->destructor is often initialized to free_netdev (as shown in Figure 8-4) so that the driver does not need to explicitly call the latter function after unregistration.

我们在“设备初始化”部分看到了结构的初始化如何net_device在设备驱动probe程序例程和通用设置例程之间划分。由于虚拟设备没有例程,因此表 8-2表 8-3probe中的分类不适用于它们。

We saw in the section "Device Initialization" how the initialization of net_device structures is split between the device driver's probe routine and generic setup routines. Because virtual devices do not have a probe routine, the classification in Tables 8-2 and 8-3 does not apply to them.

虚拟设备驱动程序注册到“设备注册状态通知netdev_chain” 部分中描述的通知链,因为大多数虚拟设备都是在真实设备之上定义的,因此对真实设备的更改也会影响虚拟设备。让我们看两个例子:

Virtual device drivers register to the netdev_chain notification chain described in the section "Device Registration Status Notification" because most virtual devices are defined on top of real devices, so changes to real devices affect virtual ones, too. Let's see two examples:

粘合
Bonding

绑定是一种虚拟设备,允许您捆绑一组接口并使它们看起来像一个接口。可以使用不同的算法在一组接口之间分配流量,其中之一是简单的循环算法。我们以图8-9(a)为例。当eth0关闭时,绑定接口 bond0需要知道这一情况,以便在实际设备之间分配流量时将其考虑在内。如果eth1也出现故障,则 bond0必须被禁用,因为不会留下任何可用的实际设备。

Bonding is a virtual device that allows you to bundle a set of interfaces and make them look like a single one. Traffic can be distributed between the set of interfaces using different algorithms, one of which is a simple round robin. Let's take the example in Figure 8-9(a). When eth0 goes down, the bonding interface bond0 needs to know about it to take it into account when distributing traffic between the real devices. In case eth1 went down too, bond0 would have to be disabled because there would not be any working real device left.

VLAN接口
VLAN interfaces

Linux 支持 802.1Q 协议并允许您定义虚拟 LAN (VLAN) 接口。考虑图 8-9(b)中的示例,其中用户在eth0上定义了两个 VLAN 接口。当eth0 关闭时,所有虚拟 (VLAN) 接口也必须关闭。

Linux supports the 802.1Q protocol and allows you to define Virtual LAN (VLAN) interfaces. Consider the example in Figure 8-9(b), where the user has defined two VLAN interfaces on eth0. When eth0 goes down, all virtual (VLAN) interfaces must go down, too.

a) 绑定接口 b) VLAN 接口

图 8-9。a) 绑定接口 b) VLAN 接口

Figure 8-9. a) Bonding interface b) VLAN interfaces

锁定

Locking

我们在“ net_device 结构的组织”一节中看到,dev_base列表和两个哈希表dev_name_headdev_name_index受到dev_base_list锁的保护。net_device然而,该锁仅用于序列化对列表和表的访问,而不用于序列化对数据结构内容的更改。net_device 内容更改由路由网络链接信号量 ( ) 负责rtnl_sem,该信号量分别通过rtnl_lock和获取和释放rtnl_unlock[ * ]该信号量用于序列化对net_device实例的更改:

We saw in the section "Organization of net_device Structures" that the dev_base list and the two hash tables dev_name_head and dev_name_index are protected by the dev_base_list lock. That lock, however, is used only to serialize accesses to the list and tables, not to serialize changes to the contents of net_device data structures. net_device content changes are taken care of by the Routing Netlink semaphore (rtnl_sem), which is acquired and released with rtnl_lock and rtnl_unlock, respectively.[*] This semaphore is used to serialize changes to net_device instances from:

运行时事件
Runtime events

例如,当链路状态发生变化时(例如,插入或拔出网线),内核需要通过修改 来改变设备状态dev->flags

For example, when the link state changes (e.g., a network cable is plugged or unplugged), the kernel needs to change the device state by modifying dev->flags.

配置变更
Configuration changes

当用户使用net-tools包 中的ifconfigrouteIPROUTE2包中的ip 等命令应用配置更改时,将分别通过命令和 Netlink 套接字通知内核。通过这些接口调用的例程必须使用锁。ioctl

When the user applies a configuration change with commands such as ifconfig and route from the net-tools package, or ip from the IPROUTE2 package, the kernel is notified via ioctl commands and the Netlink socket, respectively. The routines invoked via these interfaces must use locks.

net_device数据结构包括一些用于锁定的字段,其中:

The net_device data structure includes a few fields used for locking, among them:

ingress_lock
ingress_lock

queue_lock
queue_lock

分别由流量控制在处理入口和出口流量调度时使用。

Used by Traffic Control when dealing with ingress and egress traffic scheduling, respectively.

xmit_lock
xmit_lock

xmit_lock_owner
xmit_lock_owner

用于同步对设备驱动程序hard_start_xmit函数的访问。

Used to synchronize accesses to the device driver hard_start_xmit function.

有关这些锁的更多详细信息,请参阅第 11 章

For more details on these locks, please refer to Chapter 11.

通过 /proc 文件系统进行调整

Tuning via /proc Filesystem

/proc中没有可用于调整设备注册和取消注册任务的文件。

There are no files in /proc that can be used to tune the device registration and unregistration tasks.

本章介绍的函数和变量

Functions and Variables Featured in This Chapter

表 8-4总结了本章介绍的函数、数据结构和变量。

Table 8-4 summarizes the functions, data structures, and variables introduced in this chapter.

表 8-4。本章介绍的函数、数据结构和变量

Table 8-4. Functions, data structures, and variables introduced in this chapter

姓名

Name

描述

Description

功能

Functions

 

alloc_netdev alloc_

alloc_netdev alloc_

xxx dev wrappers

xxx dev wrappers

分配并部分初始化net_device结构。

Allocate and partially initialize a net_device structure.

free_netdev

free_netdev

释放一个net_device结构。

Frees a net_device structure.

dev_alloc_name

dev_alloc_name

完成设备名称。

Completes a device name.

register_netdevice,

register_netdevice,

register_netdev

register_netdev

unregister_netdevice,

unregister_netdevice,

unregister_netdev

unregister_netdev

注册和取消注册网络设备。API是 API 的包装器。xxx _netdevxxx _netdevice

Register and unregister a network device. The xxx _netdev APIs are wrappers for the xxx _netdevice APIs.

xxx _setup

xxx _setup

用于初始化部分结构的辅助例程net_device。每种最常见的接口类型都有一个。

Helper routines used to initialize part of the net_device structure. There is one for each of the most common interface types.

dev_hold

dev_hold

dev_put

dev_put

增加和减少结构上的引用计数net_device

Increment and decrement the reference count on a net_device structure.

netif_carrier_on

netif_carrier_on

netif_carrier_off

netif_carrier_off

netif_carrier_ok

netif_carrier_ok

当设备上的载体被检测到、丢失或被读取时分别调用。

Called when the carrier on a device is detected, lost, or to be read, respectively.

netif_device_attach

netif_device_attach

netif_device_detach

netif_device_detach

分别在设备插入系统和从系统拔出时调用。当系统进入挂起模式然后恢复时也会调用。

Called when a device is plugged into and unplugged from the system, respectively. Called also when the system goes into suspend mode and then resumes.

netif_start_queue

netif_start_queue

netif_stop_queue

netif_stop_queue

netif_queue_stopped

netif_queue_stopped

分别调用以启动、停止和检查设备出口队列的状态。

Called to start, stop, and check the status of the device egress queue, respectively.

dev_ethtool

dev_ethtool

处理来自用户空间命令的ioctl命令 。ethtool

Processes ioctl commands from the ethtool user-space command.

变量

Variables

 

dev_base

dev_base

dev_name_head

dev_name_head

dev_index_head

dev_index_head

dev_base_lock

dev_base_lock

dev_base是已注册网络设备的平面列表。是两个结构哈希表,根据设备名称和 ID 进行索引。前面的三个结构都受到锁的保护。dev_ xxx _headnet_devicedev_base_lock

dev_base is a flat list of registered network devices. dev_ xxx _head are two hash tables for net_device structures, indexed on the device's name and ID. The previous three structures are protected by the dev_base_lock lock.

lweventlist

lweventlist

lweventlist_lock

lweventlist_lock

lweventlist是待处理事件的列表lw_event。该列表受 保护lweventlist_lock

lweventlist is a list of pending lw_event events. The list is protected by lweventlist_lock.

数据结构

Data structure

 

lw_event

lw_event

链路状态改变事件。

Link state change event.

本章介绍的文件和目录

Files and Directories Featured in This Chapter

图 8-10显示了本章提到的文件和目录在内核源代码树中的位置。

Figure 8-10 shows where the files and directories mentioned in this chapter are located in the kernel source tree.

本章介绍的文件和目录

图 8-10。本章介绍的文件和目录

Figure 8-10. Files and directories featured in this chapter




[ * ]例如,参见net_olddevs_initdrivers /net/Space.c。这个函数被标记为第 7 章device_initcall中介绍的宏,在启动时执行。相同的函数负责环回设备的注册。

[*] See, for example, net_olddevs_init in drivers/net/Space.c. This function, which is tagged with the device_initcall macro introduced in Chapter 7, is executed at boot time. The same function takes care of the registration of the loopback device.

[ * ]还有其他类似的包装器,但不遵循命名约定。此外,某些设备直接调用向内核注册,而不是使用包装器。alloc_ xxx devalloc_netdev

[*] There are other, similar wrappers that do not follow the alloc_ xxx dev naming convention. Furthermore, some devices call alloc_netdev directly to register with the kernel instead of using a wrapper.

[ * ]只有少数虚拟设备的设备驱动程序使用此方法(例如,参见net/8021q/vlan.c)。图 8-4中的两个调用 是互斥的。

[*] The device drivers of only a few virtual devices use this approach (see, for example, net/8021q/vlan.c). The two calls in Figure 8-4 are mutually exclusive.

[ ]一个有趣的例外是环回设备,其初始化是硬编码在drivers/net/loopback.c 的loopback_dev定义中的。

[] An interesting exception is the loopback device, whose initialization is hardcoded in the loopback_dev definition in drivers/net/loopback.c.

[ * ]第 2 章包含数据结构所有参数的详细描述net_device

[*] Chapter 2 contains a detailed description of all the parameters of the net_device data structure.

[ * ]功能可以硬编码到驱动程序中或通过读取 NIC 上的寄存器来检索。

[*] Capabilities can be hardcoded into the driver or retrieved by reading a register on the NIC.

[ * ]这是Linux的实现选择;它不是源自任何协议规范。根据配置的出口排队规则,可能不会使用此值。

[*] This is Linux's implementation choice; it is not derived from any protocol specification. Depending on the egress queuing discipline configured, this value may not be used.

[ ] ARCNET(附加资源计算机)是一种基于令牌总线设计(类似于 802.4)的 LAN 技术,由于其确定性性能,它在工业自动化行业中找到了它的自然习惯。Linux 为 ARCNET 和一些设备驱动程序提供了一个通用层。

[] ARCNET (Attached Resource Computer) is a LAN technology based on a token bus design (similar to 802.4) that has found its natural habit in the industrial automation industry thanks to its deterministic performance. Linux provides a general layer for ARCNET and a few device drivers.

[ ] IrDA(红外数据协会)是红外无线通信的标准。

[] IrDA (Infrared Data Association) is a standard for infrared wireless communication.

[ * ]第 1 章中,您可以找到有关使用 VFT 的更多详细信息。

[*] In Chapter 1, you can find some more details on the use of VFTs.

[ * ]rtnl_unlock是信号量原语的包装器up。Whenup被直接调用,如 中rtnetlink_rcvnetdev_run_todo被显式调用。另请参阅“锁定”部分。

[*] rtnl_unlock is a wrapper around the semaphore primitive up. When up is called directly, as in rtnetlink_rcv, netdev_run_todo is called explicitly. See also the section "Locking."

[ * ]例程在第11 章中描述。netif_xxx_queue

[*] netif_xxx_queue routines are described in Chapter 11.

[ * ]有关看门狗定时器的更多详细信息,请参阅第 11 章。

[*] See Chapter 11 for more details on the watchdog timer.

[ * ]有关用于启动、停止和重新启动出口队列的例程的更多详细信息,请参阅第 11 章。

[*] See Chapter 11 for more detail on the routines used to start, stop, and restart the egress queue.

[ * ]也可以使用其他例程来获取和释放信号量。有关更多详细信息,请参阅 include/linux/rtnetlink.h 。

[*] Other routines can also be used to acquire and release the semaphore. See include/linux/rtnetlink.h for more details.

第三部分。传输和接收

Part III. Transmission and Reception

这五章的目的是将所有可能影响内核内数据包路径的功能放在上下文中,并让您了解全局。您将看到每个子系统应该做什么以及它何时出现。本章不会涉及路由,因为路由有一个很大的章节,也不会涉及防火墙,这超出了本书的范围。

The aim of these five chapters is to put into context all the features that can influence the path of a packet inside the kernel, and to give you an idea of the big picture. You will see what each subsystem is supposed to do and when it comes into the picture. This chapter will not touch upon routing, which has a large chapter of its own, or firewalling, which is beyond the scope of this book.

在一般用法中,术语“传输”通常用于指任何方向的通信。但在内核讨论中,传输仅指向外发送帧,而接收指传入的帧。在某些地方,我使用术语“入口”表示接收,使用“出口”表示传输。

In general usage, the term transmission is often used to refer to communications in any direction. But in kernel discussions, transmission refers only to sending frames outward, whereas reception refers to frames coming in. In some places, I use the terms ingress for reception and egress for transmission.

转发的数据包(起源和终止于远程系统,但使用本地系统进行路由)构成了另一个结合了接收和传输元素的类别。第 10 章介绍了转发的一些方面;第五部分和第七部分进行了更彻底的讨论。

Forwarded packets—which both originate and terminate in remote systems but use the local system for routing—constitute yet another category that combines elements of reception and transmission. Some aspects of forwarding are presented in Chapter 10; a more thorough discussion appears in Parts V and VII.

我们在第一章中看到了术语“帧”“数据报”“数据包”之间的区别。由于第 III 部分中的章节讨论了 L2 和 L3 之间的接口,因此术语“帧”“数据包”在大多数情况下都是正确的。尽管我主要使用术语“帧” ,但有时在引用不引用任何特定层的数据单元时我可能会使用“数据包” 。packet这个词是我们正在讨论的代码中最常见的一个。

We saw in Chapter 1 the difference between the terms frame, datagram, and packet. Because the chapters in Part III discuss the interface between L2 and L3, both the terms frame and packet would be correct in most cases. Even though I'll mostly use the term frame, I may sometimes use packet when referring to a data unit with no reference to any particular layer. The word packet is the one most commonly seen in the code we are discussing.

以下是我们将在第三部分的每一章中看到的内容:

Here is what we will see in each chapter of Part III:

第 9 章 中断和网络驱动程序
Chapter 9 Interrupts and Network Drivers

在本章中,您将概述下半部分处理程序和内核同步机制。

In this chapter, you will be given an overview on both bottom half handlers and kernel synchronization mechanisms.

第10章 帧接收
Chapter 10 Frame Reception

本章继续描述接收帧通过 L2 层的路径。

This chapter goes on to describe the path through the L2 layer of a received frame.

第十一章 帧传输
Chapter 11 Frame Transmission

第 11 章与第 10 章相同,但针对的是传输(传出)帧。

Chapter 11 does the same as Chapter 10, but for a transmitted (outgoing) frame.

第 12 章 关于中断的一般和参考资料
Chapter 12 General and Reference Material About Interrupts

这是前面章节的参考资料库。

This is a repository of reference material for the previous chapters.

第 13 章 协议处理程序
Chapter 13 Protocol Handlers

本章将通过讨论如何将入口帧传递给正确的 L3 协议接收例程来结束本书的这一部分。

This chapter will conclude this part of the book with a discussion of how ingress frames are handed to the right L3 protocol receive routines.

第 9 章中断和网络驱动程序

Chapter 9. Interrupts and Network Drivers

前面的章节概述了如何处理网络代码中核心组件的初始化。本书的其余部分逐个功能或逐个子系统地分析网络的实现方式、为何引入功能,以及在有意义时它们如何相互交互。

The previous chapters gave an overview of how the initialization of core components in the networking code is taken care of. The remainder of the book offers a feature-by-feature or subsystem-by-subsystem analysis of how networking is implemented, why features were introduced, and, when meaningful, how they interact with each other.

本章首先解释数据包如何在 L2 或驱动程序层与 IP 或网络层之间传输,详细信息将在第五部分中描述。我将大量引用第 2 章和第 8章中介绍的数据结构,因此您应该准备好根据需要返回这些章节。

This chapter begins an explanation of how packets travel between the L2 or driver layer and the IP or network layer described in detail in Part V. I'll be referring a lot to the data structures introduced in Chapters 2 and 8, so you should be ready to turn back to those chapters as needed.

即使在内核准备好处理来自或发往 L2 层的帧之前,它也必须处理微妙而复杂的中断系统设置,以便每秒处理数千帧成为可能。这就是本章的主题。

Even before the kernel is ready to handle the frame that is coming from or going to the L2 layer, it must deal with the subtle and complex system of interrupts set up to make the handling of thousands of frames per second possible. That is the subject of this chapter.

其他一些一般性问题影响本章的讨论:

A couple of other general issues affect the discussion in this chapter:

  • 当 Linux 内核编译为支持对称多处理 (SMP) 并在多处理器系统上运行时,用于接收和传输数据包的代码将充分利用该功能。所涉及的数据结构在设计时就考虑到了这一目标。在本章中,我们将特别关注 SMP 支持的一个方面:新的软中断队列和旧的积压队列之间的差异。

  • When the Linux kernel is compiled with support for symmetric multiprocessing (SMP) and runs on a multiprocessor system, the code for receiving and transmitting packets takes full advantage of that power. The data structures involved are designed with that goal in mind. In this chapter, we will look at one aspect of SMP support in particular: the differences between the new softirq queues and the old backlog queue.

  • 在谈论入口路径时,我将介绍大多数网络驱动程序仍在使用的旧接口和称为 NAPI 的新接口,它可以在中高负载下显着提高性能。

  • When talking about the ingress path, I will cover both the old interface, which is still used by most network drivers, and the new interface, called NAPI, which can significantly increase performance under medium to high loads.

在本章中,您将概述下半部分处理程序和内核同步机制。不过,要进行更详细的讨论,您可以参考 O'Reilly 的另外两本书《Understanding the Linux Kernel》《Linux Device Drivers》

In this chapter, you will be given an overview on both bottom half handlers and kernel synchronization mechanisms. However, for a more detailed discussion, you can refer to the other two O'Reilly books, Understanding the Linux Kernel and Linux Device Drivers.

决策和交通方向

Decisions and Traffic Direction

所采取的路径通过网络堆栈的数据包对于接收、传输和转发的数据包有所不同(参见 图 9-1)。处理上的差异还取决于编译到内核中的功能以及它们的配置方式。最后,所涉及的设备可以发挥作用,因为不同的设备支持不同的功能。

The paths taken by packets through the network stack differ for received, transmitted, and forwarded packets (see Figure 9-1). Differences in processing also depend on the features compiled into the kernel and how they are configured. Finally, the devices involved can make a difference because different devices support different features.

交通路线

图 9-1。交通路线

Figure 9-1. Traffic directions

虚拟设备,例如熟悉的环回接口 ( lo),倾向于使用网络堆栈内部的快捷方式。这些设备只是软件。例如,环回接口不与任何硬件关联,但绑定接口与一个或多个网卡间接关联。因此,某些虚拟接口可以消除硬件的一些限制(例如最大传输单元或 MTU),从而提高性能。

Virtual devices, such as the familiar loopback interface (lo), tend to use shortcuts inside the network stack. These devices are software only. For instance, the loopback interface is not associated with any piece of hardware, but bonding interfaces are associated indirectly with one or more network cards. Some virtual interfaces can therefore dispense with some of the limitations found with hardware (such as the Maximum Transmission Unit, or MTU) and thus speed up performance.

图 9-2给出了总体情况。这当然是非常粗略的;例如,它不会显示可能导致丢帧的所有条件。[ * ]该图包含有关入口路径的额外详细信息;您可以在第五部分、第六部分和第七部分中找到有关出口路径的更详细图表。我们将在本章的其余部分中浏览所有应成为图表一部分的链接。

Figure 9-2 gives an idea of the big picture. It is certainly very sketchy; for instance, it the does not show all of the conditions that can lead to dropping a frame.[*] The figure includes extra details about the ingress path; you can find more detailed graphs about the egress path in Parts V, VI, and VII. We will go through all the links that should be part of the graph in the rest of this chapter.

收到帧时通知驱动程序

Notifying Drivers When Frames Are Received

第 5 章中,我提到设备和内核可以使用两种主要技术来交换数据:轮询和中断。我还说过,两者结合也是一个有效的选择。本节简要概述了驱动程序通知内核有关帧接收的最常见方法,以及每种方法的主要优缺点。有些方法取决于设备上特定功能的可用性(例如临时计时器),有些方法需要对驱动程序、操作系统或两者进行更改。

In Chapter 5, I mentioned that devices and the kernel can use two main techniques for exchanging data: polling and interrupts. I also said that a combination of the two is also a valid option. This section offers a brief overview of the most common ways for a driver to notify the kernel about the reception of a frame, along with the main pros and cons for each one. Some approaches depend on the availability of specific features on the devices (such as ad hoc timers), and some need changes to the driver, the operating system, or both.

入口路径(帧接收)

图 9-2。入口路径(帧接收)

Figure 9-2. Ingress path (frame reception)

此讨论理论上适用于任何设备类型,但它最好地描述了那些可以生成大量交互(即帧的接收)的设备,例如网卡。

This discussion could theoretically apply to any device type, but it best describes those devices like network cards that can generate a high number of interactions (that is, the reception of frames).

轮询

Polling

通过这种技术,内核不断检查设备是否有话要说。例如,它可以通过连续读取设备上的内存寄存器,或者在计时器到期时返回检查它来做到这一点。可以想象,这种方法很容易浪费大量系统资源,如果操作系统和设备可以使用中断等其他技术,则很少采用这种方法。尽管如此,在某些情况下,民意调查是最好的方法。我们稍后会回到这一点。

With this technique, the kernel constantly keeps checking whether the device has anything to say. It can do that by continually reading a memory register on the device, for instance, or returning to check it when a timer expires. As you can imagine, this approach can easily waste quite a lot of system resources, and is rarely employed if the operating system and device can use other techniques such as interrupts. Still, there are cases where polling is the best approach. We will come back to this point later.

中断

Interrupts

这里,设备驱动程序代表内核,指示设备在特定事件发生时生成硬件中断。内核从其他活动中断后,将调用驱动程序注册的处理程序来满足设备的需求。当事件是帧的接收时,处理程序将帧排队到某处并通知内核。这种技术非常常见,仍然是低流量负载下的最佳选择。不幸的是,它在高流量负载下表现不佳:为接收到的每个帧强制中断很容易使 CPU 浪费所有处理中断的时间。

Here the device driver, on behalf of the kernel, instructs the device to generate a hardware interrupt when specific events occur. The kernel, interrupted from its other activities, will then invoke a handler registered by the driver to take care of the device's needs. When the event is the reception of a frame, the handler queues the frame somewhere and notifies the kernel about it. This technique, which is quite common, still represents the best option under low traffic loads. Unfortunately, it does not perform well under high traffic loads: forcing an interrupt for each frame received can easily make the CPU waste all of its time handling interrupts.

处理输入帧的代码分为两部分:首先驱动程序将帧复制到内核可访问的输入队列中,然后内核对其进行处理(通常将其传递给专用于相关协议的处理程序,例如知识产权)。第一部分在中断上下文中执行,可以抢占第二部分的执行。这意味着接受输入帧并将其复制到队列中的代码比实际处理帧的代码具有更高的优先级。

The code that takes care of an input frame is split into two parts: first the driver copies the frame into an input queue accessible by the kernel, and then the kernel processes it (usually passing it to a handler dedicated to the associated protocol such as IP). The first part is executed in interrupt context and can preempt the execution of the second part. This means that the code that accepts input frames and copies them into the queue has higher priority than the code that actually processes the frames.

在高流量负载下,中断代码将继续抢占处理代码。结果是显而易见的:在某个时刻,输入队列将满,但是由于应该使这些帧出队并处理这些帧的代码由于其优先级较低而没有机会运行,因此系统崩溃了。新帧无法排队,因为没有空间,旧帧也无法处理,因为没有可用的 CPU。这种情况在文献中称为接收活锁。

Under a high traffic load, the interrupt code would keep preempting the processing code. The consequence is obvious: at some point the input queue will be full, but since the code that is supposed to dequeue and process those frames does not have a chance to run due to its lower priority, the system collapses. New frames cannot be queued since there is no space, and old frames cannot be processed because there is no CPU available for them. This condition is called receive-livelock in the literature.

总之,该技术的优点是帧的接收和处理之间的延迟非常低,但在高负载下效果不佳。大多数网络驱动程序都使用中断,本章后面的大部分内容将讨论它们是如何工作的。

In summary, this technique has the advantage of very low latency between the reception of the frame and its processing, but does not work well under high loads. Most network drivers use interrupts, and a large section later in this chapter will discuss how they work.

在中断期间处理多个帧

Processing Multiple Frames During an Interrupt

许多 Linux 设备驱动程序都使用这种方法。当通知中断并执行驱动程序处理程序时,后者会继续下载帧并将它们排队到内核输入队列,直到最大帧数(或时间窗口)。当然,可以继续这样做,直到队列变空,但让我们记住设备驱动程序应该表现得像好公民。它们必须与其他子系统共享 CPU,并与其他设备共享 IRQ 线。礼貌行为尤其重要,因为在驱动程序处理程序运行时中断被禁用。

This approach is used by quite a few Linux device drivers. When an interrupt is notified and the driver handler is executed, the latter keeps downloading frames and queuing them to the kernel input queue, up to a maximum number of frames (or a window of time). Of course, it would be possible to keep doing that until the queue gets empty, but let's remember that device drivers should behave as good citizens. They have to share the CPU with other subsystems and IRQ lines with other devices. Polite behavior is especially important because interrupts are disabled while the driver handler is running.

正如上一节中所做的那样,存储限制也适用。每个设备的内存量有限,因此它可以存储的帧数也有限。如果驱动程序没有及时处理它们,缓冲区可能会满,并且新帧(或旧帧,取决于驱动程序策略)可能会被丢弃。如果加载的设备继续处理传入帧直到其队列清空,则其他设备可能会发生这种形式的饥饿。

Storage limitations also apply, as they did in the previous section. Each device has a limited amount of memory, and therefore the number of frames it can store is limited. If the driver does not process them in a timely manner, the buffers can get full and new frames (or old ones, depending on the driver policies) could be dropped. If a loaded device kept processing incoming frames until its queue emptied out, this form of starvation could happen to other devices.

该技术不需要对操作系统进行任何更改;它完全在设备驱动程序中实现。

This technique does not require any change to the operating system; it is implemented entirely within the device driver.

这种方法可能还有其他变化。驱动程序可以仅禁用其入口队列中有帧的设备的中断,并将轮询驱动程序队列的任务委托给内核处理程序,而不是禁用所有中断并让驱动程序对帧进行排队以供内核处理。这正是 Linux 的新接口 NAPI 所做的事情。然而,与本节中描述的方法不同,NAPI 需要对内核进行更改。

There could be other variations to this approach. Instead of keeping all interrupts disabled and having the driver queue frames for the kernel to handle, a driver could disable interrupts only for a device that has frames in its ingress queue and delegate the task of polling the driver's queue to a kernel handler. This is exactly what Linux does with its new interface, NAPI. However, unlike the approach described in this section, NAPI requires changes to the kernel.

定时器驱动中断

Timer-Driven Interrupts

该技术是对之前技术的增强。驱动程序指示设备定期生成中断,而不是让设备异步通知驱动程序有关帧接收的信息。然后,处理程序将检查自上次中断以来是否有任何帧到达,并一次处理所有帧。更好的是让驱动程序每隔一段时间产生中断,但前提是它有话要说。

This technique is an enhancement to the previous ones. Instead of having the device asynchronously notify the driver about frame receptions, the driver instructs the device to generate an interrupt at regular intervals. The handler will then check if any frames have arrived since the previous interrupt, and handles all of them in one shot. Even better would be to have the driver generate interrupts at intervals, but only if it has something to say.

根据计时器的粒度(由设备本身在硬件中实现;它不是内核计时器),设备接收的帧将经历不同级别的延迟。例如,如果设备每 100 毫秒生成一次中断,则帧接收通知的平均延迟为 50 毫秒,最大延迟为 100 毫秒。这种延迟可能是可接受的,也可能是不可接受的,具体取决于使用设备在网络连接上运行的应用程序。[ * ]

Based on the granularity of the timer (which is implemented in hardware by the device itself; it is not a kernel timer), the frames that are received by the device will experience different levels of latency. For instance, if the device generated an interrupt every 100 ms, the notification of the reception of a frame would have an average delay of 50 ms and a maximum one of 100 ms. This delay may or may not be acceptable depending on the applications running on top of the network connections using the device.[*]

驱动程序可用的粒度取决于设备必须提供的功能,因为计时器是在硬件中实现的。目前只有少数设备提供此功能,因此该解决方案并不适用于 Linux 内核中的所有驱动程序。人们可以通过禁用设备的中断并使用内核计时器来模拟该功能。然而,没有硬件的支持,并且 CPU 无法像设备那样花费尽可能多的资源来处理计时器,因此无法像设备一样频繁地调度计时器。这种解决方法最终将成为一种轮询方法。

The granularity available to a driver depends on what the device has to offer, since the timer is implemented in hardware. Only a few devices provide this capability currently, so this solution is not available for all the drivers in the Linux kernel. One could simulate that capability by disabling interrupts for the device and using a kernel timer instead. However, one would not have the support of the hardware, and the CPU cannot spend as much of its resources as the device can on handling timers, so one would not be able to schedule the timers nearly as often. This workaround would, in the end, become a polling approach.

组合

Combinations

前面几节中描述的每种方法都有一些优点和缺点。有时,可以将它们结合起来并获得更好的东西。我们说过,在低负载下,纯中断模型保证了低延迟,但在高负载下,它的表现很糟糕。另一方面,定时器驱动的中断在低负载下可能会引入过多的延迟并浪费过多的CPU时间,但在高负载下它对于减少CPU使用率和解决接收活锁问题有很大帮助。一个好的组合是在低负载下使用中断技术,在高负载下切换到定时器驱动中断。郁金香司机例如,包含在 Linux 内核中就可以做到这一点(请参阅drivers/net/tulip/interrupt.c [ * ])。

Each approach described in the previous sections has some advantages and disadvantages. Sometimes, it is possible to combine them and obtain something even better. We said that under low load, the pure interrupt model guarantees a low latency, but that under high load it performs terribly. On the other hand, the timer-driven interrupt may introduce too much latency and waste too much CPU time under low load, but it helps a lot in reducing the CPU usage and solving the receive-livelock problem under high load. A good combination would use the interrupt technique under low load and switch to the timer-driven interrupt under high load. The tulip driver included in the Linux kernel, for instance, can do this (see drivers/net/tulip/interrupt.c [*]).

例子

Example

以下代码段显示了处理多个帧的平衡方法,该代码取自 drivers/net/3c59x.c以太网驱动程序。它是来自 的关键行的选择vortex_interrupt,该函数由驱动程序注册为来自 3Com Vortex 系列设备的中断处理程序:

A balanced approach to processing multiple frames is shown in the following piece of code, taken from the drivers/net/3c59x.c Ethernet driver. It is a selection of key lines from vortex_interrupt, the function registered by the driver as the handler of interrupts from devices in 3Com's Vortex family:

静态 irqreturn_t vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    int work_done = max_interrupt_work;
    ioaddr = dev->base_addr;
    …………
    状态 = inw(ioaddr + EL3_STATUS);
    做 {
        …………
        if (状态 & RxComplete)
            vortex_rx(dev);
        如果(--work_done < 0){
            /* 禁用所有挂起的中断。*/
            …………
            /* 定时器将重新启用中断。*/
            mod_timer(&vp->定时器, jiffies + 1*HZ);
            休息;
        }
        …………
    while ((status = inw(ioaddr + EL3_STATUS)) & (IntLatch | RxComplete));
    …………
}
static irqreturn_t vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
    int work_done = max_interrupt_work;
    ioaddr = dev->base_addr;
    ... ... ...
    status = inw(ioaddr + EL3_STATUS);
    do {
        ... ... ...
        if (status & RxComplete)
            vortex_rx(dev);
        if (--work_done < 0) {
            /* Disable all pending interrupts. */
            ... ... ...
            /* The timer will re-enable interrupts. */
            mod_timer(&vp->timer, jiffies + 1*HZ);
            break;
        }
        ... ... ...
    } while ((status = inw(ioaddr + EL3_STATUS)) & (IntLatch | RxComplete));
    ... ... ...
}

其他遵循相同型号的驱动程序也会有非常相似的东西。他们可能会以不同的方式调用EL3_STATUSRxComplete符号,并且他们对xxx _rx函数的实现也可能不同,但框架将非常接近此处所示的框架。

Other drivers that follow the same model will have something very similar. They probably will call the EL3_STATUS and RxComplete symbols something different, and their implementation of an xxx _rx function may be different, but the skeleton will be very close to the one shown here.

在中vortex_interrupt,驱动程序从设备读取中断原因并将其存储到status. 网络设备可以出于不同的原因生成中断,并且可以将多个原因组合到一个中断中。如果RxComplete(由该驱动程序专门定义的符号,表示已接收到新帧)是这些原因之一,则代码将调用vortex_rx. [ * ]在执行期间,设备的中断被禁用。然而,驱动程序可以读取卡上的硬件寄存器,并查明是否在此期间发出了新的中断。这IntLatch当一个新的中断被发出时,标志为真(并且当驱动程序处理完它时,它被清除)。

In vortex_interrupt, the driver reads from the device the reasons for the interrupt and stores it into status. Network devices can generate an interrupt for different reasons, and several reasons can be grouped together in a single interrupt. If RxComplete (a symbol specially defined by this driver to mean a new frame has been received) is among those reasons, the code invokes vortex_rx.[*] During its execution, interrupts are disabled for the device. However, the driver can read a hardware register on the card and find out if in the meantime, a new interrupt was posted. The IntLatch flag is true when a new interrupt has been posted (and it is cleared by the driver when it is done processing it).

vortex_interrupt继续处理传入帧,直到寄存器表明有一个中断挂起 ( IntLatch) 并且这是由于接收到帧 ( RxComplete) 造成的。这也意味着RxComplete一次只能处理多次发生的中断。其他类型的中断(频率要低得多)可以等待。

vortex_interrupt keeps processing incoming frames until the register says there is an interrupt pending (IntLatch) and that it is due to the reception of a frame (RxComplete). This also means that only multiple occurrences of RxComplete interrupts can be handled in one shot. Other types of interrupts, which are much less frequent, can wait.

最后(这是良好公民身份的切入点)如果达到可处理(存储在 )中的输入帧的最大数量,则循环终止work_done。该驱动程序使用默认值 32,并允许在模块加载时调整该值。

Finally—here is where good citizenship enters—the loop terminates if it reaches the maximum number of input frames that can be processed, stored in work_done. This driver uses a default value of 32 and allows that value to be tuned at module load time.

中断处理程序

Interrupt Handlers

我们在本章中讨论的大量帧处理都是为了响应来自网络硬件的中断而发生的。中断触发的函数调度是一个复杂的话题,值得研究,尽管它与网络无关。因此,在本节中,我们讨论不同网络驱动程序处理中断的各种方式,并介绍下半部分和软中断的概念。

A good deal of the frame handling we discuss in this chapter takes place in response to interrupts from network hardware. The scheduling of functions triggered by interrupts is a complicated topic and deserves some study, even though it doesn't concern networking in particular. Therefore, in this section, we discuss the various ways that interrupts are handled by different network drivers and introduce the concepts of bottom halves and softirqs.

第 5 章中,我们看到了设备驱动程序如何使用 IRQ 号注册其处理程序,但我们没有看到硬件中断如何将帧处理委托给软件中断处理程序。本节将描述如何处理与帧接收相关的中断请求,直至第 13 章中讨论的协议处理程序接收他们的数据包。我们将看到硬件 IRQ 和软件 IRQ 之间的关系以及为什么需要后一类。我们将简要了解旧内核如何处理中断,然后将旧方法与内核版本 2.4 引入的新方法进行比较。本讨论将展示新模型相对于旧模型的优势,特别是在性能方面。

In Chapter 5, we saw how device drivers register their handlers with an IRQ number, but we did not see how hardware interrupts delegate frame processing to software interrupt handlers. This section will describe how an interrupt request associated with the reception of a frame is handled all the way to the point where protocol handlers discussed in Chapter 13 receive their packets. We will see the relationship between hardware IRQs and software IRQs and why the latter category is needed. We will briefly see how interrupts were handled with the old kernels and then compare the old approach to the new one introduced with kernel version 2.4. This discussion will show the advantages of the new model over the old one, especially in the area of performance.

在开始讨论软中断之前,我们需要简单介绍一下下半部处理程序的概念 。不过,我不会详细介绍它们,因为它们已记录在其他资源中,特别是Understanding the Linux KernelLinux Device Drivers

Before launching into softirqs, we need a small introduction to the concept of bottom half handlers . However, I will not go into much detail about them because they are documented in other resources, notably Understanding the Linux Kernel and Linux Device Drivers.

下半部处理程序的原因

Reasons for Bottom Half Handlers

每当 CPU 收到中断通知时,它就会调用与该中断关联的处理程序,该处理程序由一个数字标识。在处理程序执行期间,内核代码被称为处于中断上下文中为中断服务的 CPU 禁用中断。这意味着如果CPU正忙于服务一个中断,则它无法接收其他中断,无论是相同类型还是不同类型。[ * ] CPU 也不能执行任何其他进程:它完全属于中断处理程序并且不能被抢占。

Whenever a CPU receives an interrupt notification, it invokes the handler associated with that interrupt, which is identified by a number. During the handler's execution—in which the kernel code is said to be in interrupt context interrupts are disabled for the CPU serving the interrupt. This means that if a CPU is busy serving one interrupt, it cannot receive other interrupts, whether of the same type or of different types.[*] Nor can the CPU execute any other process: it belongs totally to the interrupt handler and cannot be preempted.

在最简单的情况下,这些是由中断触发的主要事件:

In the simplest situation, these are the main events touched off by an interrupt:

  1. 设备产生中断,硬件通知内核。

  2. The device generates an interrupt and the hardware notifies the kernel.

  3. 如果内核没有提供另一个中断服务(并且如果由于其他原因没有禁用中断),它将看到该通知。

  4. If the kernel is not serving another interrupt (and if interrupts are not disabled for other reasons) it will see the notification.

  5. 内核禁用本地 CPU 的中断并执行与接收到的中断类型关联的处理程序。

  6. The kernel disables interrupts for the local CPU and executes the handler associated with the interrupt type received.

  7. 内核退出中断处理程序并重新启用本地 CPU 的中断。

  8. The kernel exits the interrupt handler and re-enables interrupts for the local CPU.

简而言之,中断处理程序是不可抢占和不可重入的。(当函数不能被自身的另一次调用中断时,该函数被定义为不可重入。对于中断处理程序,这仅意味着它们在禁用中断的情况下执行。)这种设计选择有助于减少竞争条件的可能性。然而,由于 CPU 的功能非常有限,因此非抢占式设计可能会对内核以及等待 CPU 服务的进程的性能产生严重影响。

In short, interrupt handlers are nonpreemptible and non-reentrant. (A function is defined as non-reentrant when it cannot be interrupted by another invocation of itself. In the case of interrupt handlers, it simply means that they are executed with interrupts disabled.) This design choice helps reduce the likelihood of race conditions. However, because the CPU is so limited in what it can do, the nonpreemptible design has potentially serious effects on performance by the kernel as well as the processes waiting to be served by the CPU.

因此,中断处理程序完成的工作应该尽可能快。中断处理程序在中断上下文期间所需的处理量取决于事件的类型。例如,键盘可能只是在每次按下某个键时发送一个中断,这需要很少的工作量来处理:处理程序只需将按键的代码存储在某处,并且每秒最多运行几次。在其他时候,处理中断所需的操作并不是微不足道的,并且它们的执行可能需要大量的 CPU 时间。例如,网络设备有一项相对复杂的工作:它们需要分配一个缓冲区(sk_buff),将接收到的数据复制到其中,初始化缓冲区结构中的一些参数(protocol)以告诉高层协议处理程序来自驱动程序的数据类型,等等。

Therefore, the work done by interrupt handlers should be as quick as possible. The amount of processing needed by the interrupt handlers during interrupt context depends on the type of event. A keyboard, for instance, may simply send an interrupt every time a key is pressed, which requires very little effort to be handled: the handler simply needs to store the code of the key somewhere, and run a few times per second at most. At other times, the actions required to handle an interrupt are not trivial and their executions could require much CPU time. Network devices, for instance, have a relatively complex job: they need to allocate a buffer (sk_buff), copy the received data into it, initialize a few parameters within the buffer structure (protocol) to tell the higher-layer protocol handlers what kind of data is coming from the driver, and so on.

这就是下半部处理程序的概念发挥作用的地方。即使中断触发的动作需要大量的CPU时间,大部分动作通常都可以等待。首先允许中断抢占CPU,因为如果操作系统让硬件等待太久,可能会丢失数据。对于实时流数据来说显然是这样,对于任何必须将传入数据存储在固定大小缓冲区中的硬件来说也是如此。如果硬件丢失数据,通常无法恢复。

Here is where the concept of a bottom half handler comes into play. Even if the action triggered by an interrupt needs a lot of CPU time, most of this action can usually wait. Interrupts are allowed to preempt the CPU in the first place because if the operating system makes the hardware wait too long, it may lose data. This is obviously true of real-time streaming data, but also is true of any hardware that has to store incoming data in fixed-size buffers. And if the hardware loses data, there is usually no way to get it back.

另一方面,如果必须延迟或抢占内核或用户空间进程,则不会丢失任何数据(实时系统除外,实时系统需要完全不同的处理进程和中断的方式) 。鉴于这些考虑,现代中断处理程序分为上半部分和下半部分。上半部分包含释放 CPU 之前必须执行的所有内容,以保存数据。下半部分包含了可以在相对空闲的情况下完成的所有事情。

On the other hand, if the kernel or a user-space process has to be delayed or preempted, no data will be lost (with the exception of real-time systems, which entail a completely different way of handling processes as well as interrupts). In light of these considerations, modern interrupt handlers are divided into a top half and a bottom half. The top half consists of everything that has to be executed before releasing the CPU, to preserve data. The bottom half contains everything that can be done at relative leisure.

人们可以将下半部分定义为执行特定功能的异步请求。通常,当您想要执行一个函数时,您不必请求任何内容 - 您只需调用它即可。当中断到来时,您有很多事情要做,但又不想立即做。因此,您将大部分工作打包到一个函数中,并作为下半部分提交。

One can define a bottom half as an asynchronous request to execute a particular function. Normally, when you want to execute a function, you do not have to request anything—you simply invoke it. When an interrupt arrives, you have a lot to do and don't want to do it right away. Thus, you package most of the work into a function that you submit as a bottom half.

与前面所示的简单模型相比,以下模型允许内核保持中断禁用状态的时间要短得多:

The following model allows the kernel to keep interrupts disabled for much less time than the simple model shown previously:

  1. 设备向 CPU 发出信号以通知其中断。

  2. The device signals the CPU to notify it of the interrupt.

  3. CPU 执行相关的上半部分,禁用进一步的中断通知,直到该处理程序完成其工作。

  4. The CPU executes the associated top half, disabling further interrupt notifications until this handler has finished its job.

  5. 通常,上半部分执行以下操作:

    1. 它将内核稍后处理中断事件所需的所有信息保存在 RAM 中的某个位置。

    2. 它在某处标记一个标志(或使用另一个内核机制触发某些内容),以确保内核知道中断并使用处理程序保存的数据来完成事件处理。

    3. 在终止之前,它会重新启用本地 CPU 的中断通知。

  6. Typically, a top half performs the following:

    1. It saves somewhere in RAM all the information that the kernel will need later to process the interrupt event.

    2. It marks a flag somewhere (or triggers something using another kernel mechanism) to make sure the kernel will know about the interrupt and will use the data saved by the handler to complete the event processing.

    3. Before terminating, it re-enables the interrupt notifications for the local CPU.

  7. 在稍后的某个时刻,当内核没有处理更紧迫的事务时,它会检查中断处理程序设置的标志(表示存在要处理的数据)并调用相关的下半部分处理程序。它还清除该标志,以便稍后可以识别中断处理程序何时再次设置该标志。

  8. At some later point, when the kernel is free of more pressing matters, it checks the flag set by the interrupt handler (signaling the presence of data to be processed) and calls the associated bottom half handler. It also clears the flag so that it can later recognize when the interrupt handler sets the flag again.

随着时间的推移,Linux 开发人员尝试了不同类型的下半部分,它们遵循不同的规则。网络在新实现的开发中发挥了重要作用,因为网络需要低延迟,即帧的接收和传送之间的最短时间。由于接收和传输涉及大量任务,因此低延迟对于网络设备驱动程序比其他类型的设备更重要。正如前面“中断”部分所述,在等待处理时积累大量帧可能会造成灾难性的后果。声卡是需要快速响应的设备的另一个例子。

Over time, Linux developers have tried different types of bottom halves, which obey different rules. Networking has played a large role in the development of new implementations, because of networking's need for low latency—that is, a minimal amount of time between the reception of a frame and its delivery. Low latency is more important for network device drivers than for other types of devices because of the high number of tasks involved in reception and transmission. As described earlier in the section "Interrupts," it can be disastrous to let a large number of frames build up while waiting to be handled. Sound cards are another example of devices requiring fast response.

下半部分解决方案

Bottom Halves Solutions

内核提供了不同的机制来实现下半部分和推迟一般工作。这些机制的不同之处主要在于以下几点:

The kernel provides different mechanism for implementing bottom halves and for deferring work in general. These mechanisms differ mainly with regard to the following points:

运行上下文
Running context

内核将中断视为具有与用户空间进程或其他内核代码不同的运行上下文。当下半部执行的函数能够进入睡眠状态时,它仅限于进程上下文中允许的机制,而不是中断上下文。

Interrupts are seen by the kernel as having a different running context from user-space processes or other kernel code. When the function executed by a bottom half is capable of going to sleep, it is restricted to mechanisms allowed in process context, as opposed to interrupt context.

并发和锁定
Concurrency and locking

当一种机制可以利用 SMP 时,这会对如何强制执行序列化(如果需要)以及锁定如何影响可伸缩性产生影响。

When a mechanism can take advantage of SMP, this has implications for how serialization is enforced (if necessary) and how locking influences scalability.

在本章中,我们将只关注那些不需要进程上下文的机制,即软中断和微线程。在下一节中,我们将简要了解它们对并发性和锁定。

In this chapter, we will look only at those mechanisms that do not need a process context—namely, softirqs and tasklets. In the next section, we will briefly see their implications for concurrency and locking.

当需要推迟执行可能休眠的函数时,需要使用专用的内核线程或工作队列。工作队列只是一个队列,您可以在其中排队执行函数的请求,并且内核线程将处理它。在这种情况下,该函数将在内核线程的上下文中执行,因此允许休眠。由于网络代码主要使用软中断和微线程,因此我们不会查看工作队列。

When you need to defer the execution of a function that may sleep, you need to use a dedicated kernel thread or work queues . A work queue is simply a queue where you can queue a request to execute a function, and a kernel thread will take care of it. In this case, the function would be executed in the context of a kernel thread, and therefore sleeping is allowed. Since the networking code mainly uses softirq and tasklets, we will not look at work queues.

并发和锁定

Concurrency and Locking

在启动网络驱动程序用于处理下半部分的代码之前,我们需要一些并发性的背景知识并发性指的是可能相互干扰的函数,因为它们被调度在不同的 CPU 上,或者因为一个函数被内核挂起以运行另一个函数。相关主题是锁和禁用中断。(在了解 Linux 内核Linux 设备驱动程序中详细讨论了并发性。)

Before launching into the code that network drivers use to handle bottom halves, we need some background on concurrency, which refers to functions that can interfere with each other either because they are scheduled on different CPUs or because one is suspended by the kernel to run another. Related topics are locks and the disabling of interrupts. (Concurrency is discussed in detail in both Understanding the Linux Kernel and Linux Device Drivers.)

本章将介绍三种不同类型的函数来处理中断、旧式下半部分、软中断和微线程。所有这些都可以用来安排函数的执行,但它们有一些很大的差异。就并发性而言,我们可以将差异总结如下:

Three different types of functions will be introduced in this chapter to handle interrupts, old-style bottom halves, softirqs, and tasklets. All of them can be used to schedule the execution of a function, but they come with some big differences. As far as concurrency is concerned, we can summarize the differences as follows:

  • 无论 CPU 数量如何(内核 2.2),任何时候都只能运行一个旧式下半部分。

  • Only one old-style bottom half can run at any time, regardless of the number of CPUs (kernel 2.2).

  • 每个tasklet 在任何时间只能运行一个实例。不同的tasklet 可以在不同的CPU 上同时运行。这意味着对于给定的任何tasklet,不需要强制执行任何序列化,因为它已经由内核强制执行:您不能同时运行同一tasklet 的多个实例。

  • Only one instance of each tasklet can run at any time. Different tasklets can run concurrently on different CPUs. This means that given any tasklet, there is no need to enforce any serialization because already it is enforced by the kernel: you cannot have multiple instances of the same tasklet running concurrently.

  • 每个软中断只能在一个 CPU 上同时运行一个实例。然而,同一个软中断可以同时运行在不同的CPU上。这意味着对于任何软中断,您都需要确保不同 CPU 对共享数据的访问使用正确的锁定。为了提高并行性,软中断应设计为尽可能仅访问每个 CPU 的数据,从而大大减少锁定的需要。

  • Only one instance of each softirq can run at the same time on a CPU. However, the same softirq can run on different CPUs concurrently. This means that given any softirq you need to make sure that accesses to shared data by different CPUs use proper locking. To increase parallelization, the softirqs should be designed to access only per-CPU data as much as possible, reducing the need for locking considerably.

因此,这三个特征需要不同类型的锁定机构。允许的并发性越高,程序员必须越仔细地设计执行的代码,以确保准确性和性能。对于任何给定上下文,软中断还是微线程是否代表最佳选择取决于锁定和并发要求。在大多数情况下,tasklet 是最佳选择。但考虑到接收和传输网络任务的严格响应要求,在这两种特定情况下软中断是首选。我们将在本章后面看到网络代码如何使用软中断。

Therefore, these three features require different kinds of locking mechanisms. The higher the concurrency allowed, the more carefully the programmer has to design the code executed, for the sake of both accuracy and performance. Whether a softirq or a tasklet represents the best choice for any given context depends on both locking and concurrency requirements. In most cases, tasklets are the way to go. But given the tight response requirements of the receive and transmit networking tasks, softirqs are preferred in those two specific cases. We will see later in this chapter how the networking code uses softirqs.

在某些情况下,程序员必须禁用硬件中断、软件中断, 或两者。对上下文的详细讨论需要 SMP 的背景、Linux 内核的抢占以及本书范围之外的其他问题。然而,要理解网络代码,您需要了解用于启用和禁用中断的主要函数的含义。表 9-1总结了本章中我们需要的内容(您可以在kernel/softirq.cinclude/asm- /hardirq.hinclude/asm- /spinlock.hinclude/linux/spinlock中找到更多内容) 。 h)。其中一些可以全局定义,另一些可以按架构定义。XXX XXX

In some cases, the programmer has to disable hardware interrupts, software interrupts , or both. A detailed discussion of the contexts requires a background in SMP, preemption in the Linux kernel, and other matters outside the scope of this book. However, to understand the networking code you need to know the meaning of the main functions used to enable and disable interrupts. Table 9-1 summarizes the ones we need in this chapter (you can find many more in kernel/softirq.c, include/asm- XXX /hardirq.h, include/asm- XXX /spinlock.h, and include/linux/spinlock.h). Some of them may be defined globally and others per architecture.

表 9-1。一些与软硬件中断相关的API

Table 9-1. A few APIs related to software and hardware interrupts

函数/宏

Function/macro

描述

Description

in_interrupt

in_interrupt

in_interrupt如果 CPU 当前正在处理硬件或软件中断,或者抢占被禁用,则返回 TRUE。

in_interrupt returns TRUE if the CPU is currently serving a hardware or software interrupt, or preemption is disabled.

in_softirq

in_softirq

in_softirq如果 CPU 当前正在处理软件中断,则返回 TRUE。

in_softirq returns TRUE if the CPU is currently serving a software interrupt.

in_irq

in_irq

in_irq如果 CPU 当前正在处理硬件中断,则返回 TRUE。

in_irq returns TRUE if the CPU is currently serving a hardware interrupt.

 

在“抢占”部分中,借助图 9-3,您可以看到这三个例程是如何实现的。

In the section "Preemption," and with the help of Figure 9-3, you can see how these three routines are implemented.

softirq_pending

softirq_pending

如果其 ID 作为输入参数传递的 CPU 至少有一个待处理的软中断(即,计划执行),则返回 TRUE。

Returns TRUE if there is at least one softirq pending (i.e., scheduled for execution) for the CPU whose ID was passed as the input argument.

local_softirq_pending

local_softirq_pending

如果本地 CPU 至少有一个待处理的软中断,则返回 TRUE。

Returns TRUE if there is at least one softirq pending for the local CPU.

_ _raise_softirq_irqoff

_ _raise_softirq_irqoff

设置与输入软中断类型关联的标志以将其标记为挂起。

Sets the flag associated with the input softirq type to mark it pending.

raise_softirq_irqoff

raise_softirq_irqoff

这是一个包装器,当返回 FALSE 时_ _raise_softirq_irqoff也会被唤醒 。ksoftirqdin_interrupt( )

This is a wrapper around _ _raise_softirq_irqoff that also wakes up ksoftirqd when in_interrupt( ) returns FALSE.

raise_softirq

raise_softirq

这是一个包装器raise_softirq_irqoff,它在调用之前禁用硬件中断并将其恢复到原始状态。

This is a wrapper around raise_softirq_irqoff that disables hardware interrupts before calling it and restores them to their original status.

_ _local_bh_enable

_ _local_bh_enable

local_bh_enable

local_bh_enable

local_bh_disable

local_bh_disable

_ _local_bh_enable在本地 CPU 上启用下半部分(以及软中断/tasklet),并且如果有任何软中断挂起并 返回 FALSE,local_bh_enable也会调用。invoke_softirqin_interrupt( )

_ _local_bh_enable enables bottom halves (and thus softirqs/tasklets) on the local CPU, and local_bh_enable also invokes invoke_softirq if any softirq is pending and in_interrupt( ) returns FALSE.

 

local_bh_disable禁用本地 CPU 上的下半部分。

local_bh_disable disables bottom halves on the local CPU.

local_irq_disable

local_irq_disable

local_irq_enable

local_irq_enable

禁用和启用本地 CPU 上的中断。

Disable and enable interrupts on the local CPU.

local_irq_save

local_irq_save

local_irq_restore

local_irq_restore

local_irq_save首先保存本地CPU上中断的当前状态,然后禁用它们。

local_irq_save first saves the current state of interrupts on the local CPU and then disables them.

 

local_irq_restore由于先前保存的信息,恢复本地 CPU 上的中断状态 local_irq_save

local_irq_restore restores the state of interrupts on the local CPU thanks to the information previously saved with local_irq_save.

spin_lock_bh

spin_lock_bh

spin_unlock_bh

spin_unlock_bh

分别获取和释放自旋锁。这两个函数都会禁用然后重新启用下半部分和抢占在操作过程中。

Acquire and release a spinlock, respectively. Both functions disable and then re-enable bottom halves and preemption during the operation.

抢占

Preemption

在分时系统中,内核一直能够随意抢占用户进程,但内核本身往往是非抢占式的,这意味着一旦开始运行就不会被中断,直到准备好放弃控制权为止。非抢占式内核有时会在高优先级进程准备好运行时保留它们,因为内核正在为较低优先级进程执行系统调用。为了支持实时扩展和其他原因,Linux 内核在 2.5 内核开发周期中被完全抢占。通过这个新的内核功能,系统调用和其他内核任务可以被其他具有更高优先级的内核任务抢占。

In time-sharing systems, the kernel has always been able to preempt user processes at will, but the kernel itself is often nonpreemptive, which means that once it starts running it will not be interrupted until it is ready to give up control. A nonpreemptive kernel sometimes holds up high-priority processes when they are ready to run because the kernel is executing a system call for a lower-priority process. To support real-time extensions and for other reasons, the Linux kernel was made fully preemptible during the 2.5 kernel development cycle. With this new kernel feature, system calls and other kernel tasks can be preempted by other kernel tasks with higher priorities.

因为已经做了很多工作来消除内核中的关键部分(不可抢占的代码)以支持 SMP 锁定机制,所以添加完全抢占并不是对内核的重大更改。添加抢占后,开发人员只需明确定义在何处禁用它(在硬件和软件中断代码中、在调度程序本身中、在受自旋锁和读/写锁保护的代码中等)。

Because much work had already been done to eliminate critical sections (nonpreemptible code) from the kernel to support SMP locking mechanisms, adding full preemption was not a major change to the kernel. Once preemption was added, developers just had to define explicitly where to disable it (in hardware and software interrupt code, in the scheduler itself, in the code protected by spin locks and read/write locks, etc.).

然而,有时必须禁用抢占,就像中断一样。在本节中,我将仅介绍一些与抢占相关的函数,您在浏览代码时可能会遇到这些函数,然后简要介绍如何更新一些锁定宏来处理抢占。

However, there are times when preemption, just like interrupts, must be disabled. In this section, I'll cover just a few functions related to preemption that you may bump into while browsing the code, and then briefly show how some of the locking macros have been updated to deal with preemption.

以下函数控制抢占:

The following functions control preemption:

preempt_disable
preempt_disable

禁用当前任务的抢占。可以重复调用,增加引用计数器。

Disables preemption for the current task. Can be called repeatedly, incrementing a reference counter.

preempt_enable
preempt_enable

preempt_enable_no_resched
preempt_enable_no_resched

与 的相反preempt_disable,允许再次启用抢占。preempt_enable_no_resched只是递减参考计数器,这允许在其达到零时重新启用抢占。preempt_enable此外,还检查计数器是否为零并强制调用schedule( )以允许任何更高优先级的任务运行。

The reverse of preempt_disable, allowing preemption to be enabled again. preempt_enable_no_resched simply decrements a reference counter, which allows preemption to be re-enabled when it reaches zero. preempt_enable, in addition, checks whether the counter is zero and forces a call to schedule( ) to allow any higher-priority task to run.

preempt_check_resched
preempt_check_resched

该函数由 调用preempt_enable 并与 区别开来preempt_enable_no_resched

This function is called by preempt_enable and differentiates it from preempt_enable_no_resched.

网络代码不直接处理这些例程。然而,preempt_enablepreempt_disable是间接调用的,例如,通过锁定原语,如rcu_read_lockand rcu_read_unlockspin_lockandspin_unlock等。用于访问每个 CPU 数据结构的例程,如get_cpuget_cpu_var,也会在读取数据之前禁用抢占。

The networking code does not deal with these routines directly. However, preempt_enable and preempt_disable are indirectly called, for instance, by locking primitives, like rcu_read_lock and rcu_read_unlock, spin_lock and spin_unlock, etc. Routines used to access per-CPU data structures, like get_cpu and get_cpu_var, also disable preemption before reading the data.

每个进程的计数器(命名preempt_count 并嵌入在thread_info结构中)指示给定进程是否允许抢占。该字段可以通过include/linux/preempt.h中定义的和函数来读取preempt_count( )和间接操作 。在某些情况下,内核不应被抢占。其中包括何时为硬件提供服务,以及何时使用刚刚显示的调用之一来禁用抢占。因此,分为三个组成部分。每个字节都是针对需要非抢占的不同条件的计数器:硬件中断、软件中断和一般非抢占。的布局inc_preempt_countdec_preempt_countpreempt_countpreempt_count如图9-3所示。

A counter for each process, named preempt_count and embedded in the thread_info structure, indicates whether a given process allows preemption. The field can be read with preempt_count( ) and is manipulated indirectly through the inc_preempt_count and dec_preempt_count functions defined in include/linux/preempt.h. There are situations in which the kernel should not be preempted. These include when it is servicing hardware, as well as when it uses one of the calls just shown to disable preemption. Therefore, preempt_count is split into three components. Each byte is a counter for a different condition that requires nonpreemption: hardware interrupts, software interrupts, and general nonpreemption. The layout of preempt_count is shown in Figure 9-3.

preempt_count的结构

图 9-3。preempt_count的结构

Figure 9-3. Structure of preempt_count

除了每个字节的用途之外,该图还显示了操作它的主要函数。高位字节目前尚未完全使用,但其第二个最低有效位在调用该schedule 函数之前被设置,并告诉该函数已被调用以抢占当前任务。[ * ]include/asm-xxx/hardirq.h中可以找到几个宏,使读写更加容易preempt_counter;其中一些包括图 9-3XXX_OFFSET中所示的变量,并由图中列出的函数用来递增或递减正确的字节。

The figure shows, in addition to the purpose of each byte, the main functions that manipulate it. The high-order byte is not fully used at the moment, but its second least significant bit is set before calling the schedule function and tells that function that it has been called to preempt the current task.[*] In include/asm-xxx/hardirq.h you can find several macros that make it easier to read and write preempt_counter; some of these include the XXX_OFFSET variables shown in Figure 9-3 and used by the functions listed in the figure to increment or decrement the right byte.

尽管有所有这些复杂性,每当必须对当前进程进行检查以查看它是否可以被抢占时,内核需要知道的只是是否为preempt_countNULL(为什么禁用抢占并不重要)。

Despite all this complexity, whenever a check has to be done on the current process to see if it can be preempted, all the kernel needs to know is whether preempt_count is NULL (it does not really matter why preemption is disabled).

下半部处理程序

Bottom-Half Handlers

下半部分的基础设施必须满足以下需求:

The infrastructure for bottom halves must address the following needs:

  • 将下半部分分类为正确类型

  • Classifying the bottom half as the proper type

  • 注册下半部类型与其处理程序之间的关联

  • Registering the association between a bottom half type and its handler

  • 安排下半部分执行

  • Scheduling a bottom half for execution

  • 通知内核已调度的 BH 的存在

  • Notifying the kernel about the presence of scheduled BHs

让我们首先看看 2.2 版本之前的内核如何处理下半部分处理程序 ,然后如何使用内核 2.4 和 2.6 使用的软中断来处理它们。

Let's first see how kernels up to version 2.2 handled bottom half handlers , and then how they are handled with the softirqs used by kernels 2.4 and 2.6.

内核 2.2 中的下半部处理程序

Bottom-half handlers in kernel 2.2

下半部处理程序的 2.2 模型将它们分为多种类型,这些类型根据内核检查和运行它们的时间和频率来区分。2.2 列表如下,取自include/linux/interrupt.h。在这本书中,我们最感兴趣的是NET_BH

The 2.2 model for bottom-half handlers divides them into a large number of types, which are differentiated by when and how often the kernel checks for them and runs them. The 2.2 list is as follows, taken from include/linux/interrupt.h. In this book, we are most interested in NET_BH.

枚举{
        定时器_BH = 0,
        控制台_BH,
        TQUEUE_BH,
        DIGI_BH,
        串行_BH,
        RISCOM8_BH,
        SPECIALIX_BH,
        极光_BH,
        ESP_BH,
        NET_BH,
        SCSI_BH,
        立即_BH,
        键盘_BH,
        CYCLADES_BH,
        CM206_BH,
        JS_BH,
        MACSERIAL_BH,
        ISICOM_BH
};
enum {
        TIMER_BH = 0,
        CONSOLE_BH,
        TQUEUE_BH,
        DIGI_BH,
        SERIAL_BH,
        RISCOM8_BH,
        SPECIALIX_BH,
        AURORA_BH,
        ESP_BH,
        NET_BH,
        SCSI_BH,
        IMMEDIATE_BH,
        KEYBOARD_BH,
        CYCLADES_BH,
        CM206_BH,
        JS_BH,
        MACSERIAL_BH,
        ISICOM_BH
};

每个下半部类型通过 与一个函数处理程序相关联init_bh。例如,网络代码将NET_BH下半部类型初始化为net_bh中的处理程序,这将在第 5 章net_dev_init中介绍。

Each bottom-half type is associated with a function handler by means of init_bh. The networking code, for instance, initializes the NET_BH bottom-half type to the net_bh handler in net_dev_init, which is covered in Chapter 5.

_ _initfunc(int net_dev_init(void))
{
        …………
        init_bh(NET_BH, net_bh);
        …………
}
_ _initfunc(int net_dev_init(void))
{
        ... ... ...
        init_bh(NET_BH, net_bh);
        ... ... ...
}

用于取消注册 BH 处理程序的主要函数是remove_bh。(还有其他相关函数,例如enable_bh/ disable_bh,但我们不需要看到所有它们。)

The main function used to unregister a BH handler is remove_bh. (There are other related functions too, such as enable_bh/disable_bh, but we do not need to see all of them.)

每当中断处理程序想要触发下半部分处理程序的执行时,它必须设置相应的标志mark_bh。这个函数非常简单:它将一个位设置到全局位图中bh_active,正如我们稍后将看到的,它在多个地方进行了测试。

Whenever an interrupt handler wants to trigger the execution of a bottom half handler, it has to set the corresponding flag with mark_bh. This function is very simple: it sets a bit into a global bitmap bh_active, which, as we will see in a moment, is tested in several places.

extern 内联 void mark_bh(int nr)
 {
        set_bit(nr, &bh_active);
};
extern inline void mark_bh(int nr)
 {
        set_bit(nr, &bh_active);
};

例如,您将在本章后面看到,每次网络设备驱动程序成功接收到帧时,它都会通过调用向内核发出有关该帧的信号 netif_rx。后者将新接收到的帧放入入口队列backlog(由所有 CPU 共享)中,并标记NET_BH下半部处理程序标志。

For instance, you will see later in the chapter that every time a network device driver has successfully received a frame, it signals the kernel about it with a call to netif_rx. The latter queues the newly received frame into the ingress queue backlog (shared by all the CPUs) and marks the NET_BH bottom-half handler flag.

skb_queue_tail(&backlog, skb);
标记_bh(NET_BH);
返回
skb_queue_tail(&backlog, skb);
mark_bh(NET_BH);
return

在几个例行操作期间,内核检查是否有任何下半部分被安排执行。如果有任何正在等待,内核将运行该函数do_bottom_half(当前位于kernel/softirq.c中)来执行它们。检查在以下期间执行:

During several routine operations, the kernel checks whether any bottom halves are scheduled for execution. If any are waiting, the kernel runs the function do_bottom_half (currently in kernel/softirq.c), to execute them. The checks are performed during:

do_IRQ
do_IRQ

每当内核收到有关硬件中断的 IRQ 通知时,它就会调用do_IRQ执行相关的处理程序。由于中断处理程序安排了大量的下半部分进行传输,因此什么可以比在 结束时的调用给它们带来更少的延迟do_IRQ?因此,以频率到期的常规定时器中断HZ表示两次连续执行之间的上限 do_bottom_half

Whenever the kernel is notified by an IRQ about a hardware interrupt, it calls do_IRQ to execute the associated handler. Since a good number of bottom halves are scheduled for transmission by interrupt handlers, what could give them less latency than an invocation right at the end of do_IRQ? For this reason, the regular timer interrupt that expires with frequency HZ represents an upper bound between two consecutive executions of do_bottom_half.

从中断和异常返回(包括系统调用)
Returns from interrupts and exceptions (which includes system calls)

请参阅arch/ /kernel/entry.S了解处理这种情况的汇编语言代码。XXX

See arch/ XXX /kernel/entry.S for the assembly language code that takes care of this case.

schedule
schedule

该函数决定 CPU 接下来要执行的内容,检查是否有任何下半部处理程序处于挂起状态,并赋予它们比其他任务更高的优先级。

asmlinkage 无效时间表(void)
{

        /* 当我们不持有任何锁时在这里进行“管理”工作 */
        如果(bh_mask 和 bh_active)
                转到句柄_bh;
句柄_bh_back:
        …………
句柄_bh:
        do_bottom_half();
        转到handle_bh_back;
        …………
}

This function, which decides what to execute next on the CPU, checks if any bottom-half handlers are pending and gives them higher priority over other tasks.

asmlinkage void schedule(void)
{

        /* Do "administrative" work here while we don't hold any locks */
        if (bh_mask & bh_active)
                goto handle_bh;
handle_bh_back:
        ... ... ...
handle_bh:
        do_bottom_half( );
        goto handle_bh_back;
        ... ... ...
}

run_bottom_half,用于执行挂起的中断处理程序的函数do_bottom_half,如下所示:

run_bottom_half, the function used by do_bottom_half to execute the pending interrupt handlers, looks like this:

        活动 = get_active_bhs( );
        清除_active_bhs(主动);
        bh = bh_base;
        做 {
                如果(活动&1)
                        (*bh)( );
                bh++;
                活动 >>= 1;
        while(活动);
        active = get_active_bhs( );
        clear_active_bhs(active);
        bh = bh_base;
        do {
                if (active & 1)
                        (*bh)( );
                bh++;
                active >>= 1;
        } while (active);

调用挂起处理程序的顺序取决于位图中关联标志的位置以及用于扫描这些标志的方向(由 返回get_active_bhs)。换句话说,下半区并不是按照先到先得的原则运行的。由于网络下半部分可能需要很长时间,因此那些不幸最后出队的网络可能会遇到高延迟。

The order in which the pending handlers are invoked depends on the positions of the associated flags inside the bitmap and the direction used to scan those flags (returned by get_active_bhs). In other words, bottom halves are not run on a first-come-first-served basis. And since networking bottom halves can take a long time, those that have the misfortune to be dequeued last can experience high latency.

2.2 及更早版本内核中的下半部分受到并发禁令的影响。无论 CPU 数量有多少,任何时候都只能运行一个下半部分。

Bottom halves in 2.2 and earlier kernels suffer from a ban on concurrency. Only one bottom half can run at any time, regardless of the number of CPUs.

内核2.4及以上版本中的下半部处理程序:软中断的引入

Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq

就中断处理而言,内核 2.2 和 2.4 之间最大的改进是引入了软件中断(softirqs),可以看作是bottom half handlers的多线程版本。不仅可以多个softirq并发运行,而且同一个softirq还可以并发运行在不同的CPU上。对并发性的唯一限制是每个软中断只能在一个 CPU 上同时运行一个实例。

The biggest improvement between kernels 2.2 and 2.4, as far as interrupt handling is concerned, was the introduction of software interrupts (softirqs), which can be seen as the multithreaded version of bottom half handlers . Not only can many softirqs run concurrently, but also the same softirq can run on different CPUs concurrently. The only restriction on concurrency is that only one instance of each softirq can run at the same time on a CPU.

新的软中断模型只有六种类型(来自include/linux/interrupt.h):

The new softirq model has only six types (from include/linux/interrupt.h):

枚举
{
    HI_软中断=0,
    定时器_软中断,
    NET_TX_SOFTIRQ,
    NET_RX_SOFTIRQ,
    SCSI_软中断,
    TASKLET_软中断
};
enum
{
    HI_SOFTIRQ=0,
    TIMER_SOFTIRQ,
    NET_TX_SOFTIRQ,
    NET_RX_SOFTIRQ,
    SCSI_SOFTIRQ,
    TASKLET_SOFTIRQ
};

旧模型中的所有下半部类型仍然可供旧驱动程序使用,但已被重新实现为作为该类型的软中断运行 (这意味着它们优先于其他软中断类型)。网络代码使用的两种类型和,将在后面的“网络代码如何使用软中断”部分中介绍。下一节将介绍tasklet。XXX _BHHI_SOFTIRQNET_TX_SOFTIRQNET_RX_SOFTIRQ

All the XXX _BH bottom-half types in the old model are still available to old drivers, but have been reimplemented to run as softirqs of the HI_SOFTIRQ type (which means they take priority over the other softirq types). The two types used by networking code, NET_TX_SOFTIRQ and NET_RX_SOFTIRQ, are introduced in the later section "How the Networking Code Uses softirqs." The next section will introduce tasklets.

与旧的下半部分一样,软中断在启用中断的情况下运行,因此可以随时挂起以处理新的传入中断。但是,如果某个软中断的另一个实例已在该 CPU 上挂起,则内核不允许对该软中断的新请求在该 CPU 上运行。这大大减少了所需的锁定量。每个软中断类型都可以维护一个 类型的数据结构数组softnet_data,每个 CPU 一个,用于保存当前软中断的状态信息;我们将在“ softnet_data 结构”部分中看到该结构的内容由于同一类型软中断的不同实例可以同时在不同的CPU上运行,因此软中断运行的函数仍然需要锁定共享的其他数据结构,以避免竞争条件。

Softirqs, like the old bottom halves, run with interrupts enabled and therefore can be suspended at any time to handle a new, incoming interrupt. However, the kernel does not allow a new request for a softirq to run on a CPU if another instance of that softirq has been suspended on that CPU; this drastically reduces the amount of locking needed. Each softirq type can maintain an array of data structures of type softnet_data, one per CPU, to hold state information about the current softirq; we'll see the contents of this structure in the section "softnet_data Structure." Since different instances of the same type of softirq can run simultaneously on different CPUs, the functions run by softirqs still need to lock other data structures that are shared, to avoid race conditions.

用于注册和调度软中断处理程序的函数及其背后的逻辑与 2.2 下半部分所使用的函数非常相似。

The functions used to register and schedule a softirq handler, and the logic behind them, are very similar to the ones used with 2.2 bottom halves.

软中断处理程序是在函数中注册的open_softirq,与 不同的是init_bh,它接受一个额外的参数,以便函数处理程序可以在需要时传递一些输入数据。然而,目前没有一个软中断使用这个额外的参数,并且已经提出了删除它的建议。 只需将输入参数复制到在kernel/softirq.c中声明的 open_softirq全局数组中,该数组保存类型和处理程序之间的关联。softirq_vec

softirq handlers are registered with the open_softirq function, which, unlike init_bh, accepts an extra parameter so that the function handler can be passed some input data if needed. None of the softirqs, however, currently uses that extra parameter, and a proposal has been floated to remove it. open_softirq simply copies the input parameters into the global array softirq_vec, declared in kernel/softirq.c, which holds the associations between types and handlers.

静态结构体softirq_action softirq_vec[32] _ _cacheline_aligned_in_smp;

void open_softirq(int nr, void (*action)(struct softirq_action*), void *data)
{
    softirq_vec[nr].data = 数据;
    softirq_vec[nr].action = 动作;
}
static struct softirq_action softirq_vec[32] _ _cacheline_aligned_in_smp;

void open_softirq(int nr, void (*action)(struct softirq_action*), void *data)
{
    softirq_vec[nr].data = data;
    softirq_vec[nr].action = action;
}

可以通过以下函数安排软中断在本地 CPU 上执行:

A softirq can be scheduled for execution on the local CPU by the following functions:

_ _raise_softirq_irqoff
_ _raise_softirq_irqoff

该函数与 2.2 中的函数相对应mark_bh ,只是设置与要运行的软中断关联的位标志。稍后,当检查该标志时,将调用关联的处理程序。

This function, the counterpart of mark_bh in 2.2, simply sets the bit flag associated to the softirq to run. Later on, when the flag is checked, the associated handler will be invoked.

raise_softirq_irqoff
raise_softirq_irqoff

这是一个包装器_ _cpu_raise_softirq,如果未从硬件或软件中断上下文调用该函数并且抢占尚未禁用,则它会另外ksoftirqd调度线程(本章稍后讨论)。如果从中断上下文调用该函数,则不需要调用线程,因为正如我们将看到的,do_softirq无论如何都会被调用。

This is a wrapper around _ _cpu_raise_softirq that additionally schedules the ksoftirqd thread (discussed later in this chapter) if the function is not called from a hardware or software interrupt context and preemption has not been disabled. If the function is called from interrupt context, invoking the thread is not necessary because, as we will see, do_softirq will be called anyway.

raise_softirq
raise_softirq

这是一个包装器raise_softirq_irqoff,在禁用硬件中断的情况下执行后者。

This is a wrapper around raise_softirq_irqoff that executes the latter with hardware interrupts disabled.

以下代码取自内核 2.4.5,[ * ]显示了 softirq 开发早期阶段使用的模型。它与 2.2 模型非常相似,并调用 函数do_softirq,该函数与do_bottom_half上一节中讨论的 2.2 函数相对应。do_softirq如果至少有一个软中断已被安排执行,则调用:

The following code, taken from kernel 2.4.5,[*] shows the model used at an early stage of softirq development. It is very similar to the 2.2 model, and invokes the function do_softirq, which is a counterpart to the 2.2 function do_bottom_half discussed in the previous section. do_softirq is called if at least one softirq has been scheduled for execution:

asmlinkage 无效时间表(void)
{

        /* 当我们不持有任何锁时在这里进行“管理”工作 */
        if (softirq_active(this_cpu) & softirq_mask(this_cpu))
                转到handle_softirq;
处理软中断返回:
        …………
处理软中断:
        do_softirq();
        转到handle_softirq_back;
        …………
}
asmlinkage void schedule(void)
{

        /* Do "administrative" work here while we don't hold any locks */
        if (softirq_active(this_cpu) & softirq_mask(this_cpu))
                goto handle_softirq;
handle_softirq_back:
        ... ... ...
handle_softirq:
        do_softirq( );
        goto handle_softirq_back;
        ... ... ...
}

软中断的早期阶段与 2.2 下半部模型之间的唯一区别在于,软中断版本必须在每个 CPU 的基础上检查标志,因为每个 CPU 都有自己的待处理软中断位图。

The only difference between this early stage of softirqs and the 2.2 bottom-half model is that the softirq version has to check the flags on a per-CPU basis, since each CPU has its own bitmap of pending softirqs.

的实现do_softirq也与 2.2 中的对应部分非常相似 do_bottom_half。内核也在某些相同的点调用该函数,但并不完全相同。主要变化是引入了新的每 CPU 内核线程ksoftirqd.

The implementation of do_softirq is also very similar to its counterpart do_bottom_half in 2.2. The kernel also calls the function at some of the same points, but not entirely the same. The main change is the introduction of a new per-CPU kernel thread, ksoftirqd.

do_softirq以下是可能调用的要点: [ * ]

Here are the main points where do_softirq may be invoked:[*]

do_IRQ
do_IRQ

do_IRQ在每个体系结构文件arch/ /kernel/irq.c中定义的框架是:arch-name

fastcall unsigned int do_IRQ(struct pt_regs * regs)
{
    irq_enter();
    …………
    /* 使用注册的处理程序处理 IRQ 号“irq” */
    …………
    irq_exit();
    返回1;
}

在内核 2.4 中,该函数也称为do_softirq. 对于 2.6 中的大多数架构,改为do_softirq在内部进行调用irq_exit。少数人仍然把它放在里面do_IRQ

irq_enter由于允许嵌套调用,因此仅当满足所有常见条件(没有待处理的软中断等)并且与中断上下文关联的引用计数已达到零时才进行调用,这表明内核正在离开中断上下文irq_exitinvoke_softirq

以下是kernel/softirq.cirq_exit的通用定义,但有些架构定义了自己的版本:

无效irq_exit(无效)
{
    ...
    sub_preempt_count(IRQ_EXIT_OFFSET);
    if (!in_interrupt( ) && local_softirq_pending( ))
        invoke_softirq();
    preempt_enable_no_resched();
}

smp_apic_timer_interrupt,它在arch/ /kernel/apic.c 中处理 SMP 计时器,也使用/ 。XXXirq_enterirq_exit

The skeleton for do_IRQ, which is defined in the per-architecture files arch/ arch-name /kernel/irq.c, is:

fastcall unsigned int do_IRQ(struct pt_regs * regs)
{
    irq_enter( );
    ... ... ...
    /* handle the IRQ number "irq" with the registered handler */
    ... ... ...
    irq_exit( );
    return 1;
}

In kernel 2.4, the function also called do_softirq. For most architectures in 2.6, a call to do_softirq is made inside irq_exit instead. A minority still have it inside do_IRQ.

Since nested calls to irq_enter are allowed, irq_exit calls invoke_softirq only when all the usual conditions are met (there are no softirqs pending, etc.) and the reference count associated with the interrupt context has reached zero, indicating that the kernel is leaving the interrupt context.

Here is the generic definition of irq_exit from kernel/softirq.c, but there are architectures that define their own versions:

void irq_exit(void)
{
    ...
    sub_preempt_count(IRQ_EXIT_OFFSET);
    if (!in_interrupt( ) && local_softirq_pending( ))
        invoke_softirq( );
    preempt_enable_no_resched( );
}

smp_apic_timer_interrupt, which handles SMP timers in arch/XXX/kernel/apic.c, also uses irq_enter/irq_exit.

从中断和异常返回(包括系统调用)
Returns from interrupts and exceptions (which include system calls)

这与内核 2.2 相同。

This is the same as kernel 2.2.

local_bh_enable
local_bh_enable

当 CPU 上重新启用软中断时,将通过调用 来处理挂起的请求(如果有)do_softirq

When softirqs are re-enabled on a CPU, pending requests are processed (if any) with a call to do_softirq.

内核线程ksoftirqd_CPUn
The kernel threads, ksoftirqd_CPUn

为了防止软中断独占所有 CPU(这在负载较重的网络上很容易发生,因为软NET_TX_SOFTIRQ中断NET_RX_SOFTIRQ 比用户进程具有更高的优先级),开发人员引入了一组新的每 CPU 线程。它们的名称ksoftirqd_CPU0ksoftirqd_CPU1、 等,可以通过ps命令查看。有关这些线程的更多详细信息,请参阅“ ksoftirqd 内核线程”部分。

To keep softirqs from monopolizing all the CPUs (which could happen easily on a heavily loaded network because the NET_TX_SOFTIRQ and NET_RX_SOFTIRQ interrupts have a higher priority than user processes), developers introduced a new set of per-CPU threads. These have the names ksoftirqd_CPU0, ksoftirqd_CPU1, and so on, and can be seen by a ps command. More details on these threads appear in the section "ksoftirqd Kernel Threads."

我已经描述了 i386 的一般行为;其他体系结构可能使用不同的命名约定或具有也调用do_softirq.

I have described i386 behavior in general; other architectures may use different naming conventions or have additional timers that also invoke do_softirq.

另一个有趣的地方 do_softirq是从内部调用,这在第 10 章的“设备驱动程序和内核之间的旧接口:netif_rx 的第一部分netif_rx_ni一节中进行了简要描述。内核中内置的流量生成器 ( net/core/pktgen.c ) 也会调用.do_softirq

Another interesting place where do_softirq is called is from within netif_rx_ni, which is briefly described in the section "Old Interface Between Device Drivers and Kernel: First Part of netif_rx" in Chapter 10. The traffic generator built into the kernel (net/core/pktgen.c) also calls do_softirq.

小任务

Tasklets

2.2 内核类型的大部分下半部分已转换为软中断或微线程。微线程是某个中断或其他任务推迟到稍后执行的函数。Tasklet 构建在软中断之上,通常由中断处理程序启动。(但是内核的其他部分,例如第六部分中讨论的相邻子系统,也使用微线程)。[ * ]

Most of the bottom halves of the 2.2 kernel variety have been converted to either softirqs or tasklets . A tasklet is a function that some interrupt or other task has deferred to execute later. Tasklets are built on top of softirqs and are usually kicked off by interrupt handlers. (But other parts of the kernel, such as the neighboring subsystem discussed in Part VI, also use tasklets).[*]

在“内核 2.4 及以上版本中的下半部处理程序:软中断的介绍”一节中,我们看到了软中断列表。HI_SOFTIRQ用于实现高优先级的tasklet,TASKLET_SOFTIRQ用于实现低优先级的tasklet。每次发出延迟执行请求时,结构实例tasklet_struct都会排队到由 处理的列表HI_SOFTIRQ或由 处理的另一个列表中TASKLET_SOFTIRQ

In the section "Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq," we saw the list of softirqs. HI_SOFTIRQ is used to implement high-priority tasklets, and TASKLET_SOFTIRQ is used for lower-priority ones. Each time a request for a deferred execution is issued, an instance of a tasklet_struct structure is queued onto either a list processed by HI_SOFTIRQ or another one that is instead processed by TASKLET_SOFTIRQ.

tasklet_structs 由于每个 CPU 独立处理软中断,因此每个 CPU有两个待处理列表也就不足为奇了,一个与 相关HI_SOFTIRQ,一个与 相关TASKLET_SOFTIRQ。这些是它们在kernel/softirq.c中的定义:

Since softirqs are handled independently by each CPU, it should not be a surprise that there are two lists of pending tasklet_structs for each CPU, one associated with HI_SOFTIRQ and one with TASKLET_SOFTIRQ. These are their definitions from kernel/softirq.c:

静态 DEFINE_PER_CPU(struct tasklet_head tasklet_vec) = { NULL };
静态 DEFINE_PER_CPU(struct tasklet_head tasklet_hi_vec) = { NULL };
static DEFINE_PER_CPU(struct tasklet_head tasklet_vec) = { NULL };
static DEFINE_PER_CPU(struct tasklet_head tasklet_hi_vec) = { NULL };

乍一看,tasklet 可能看起来就像旧的下半部分,但实际上有很大的区别:

At first sight, tasklets may seem to be just like the old bottom halves, but there actually are substantial differences:

  • 不同微线程的数量没有限制,而旧的下半部分对于 的每个位标志仅限于一种类型bh_base

  • There is no limit on the number of different tasklets, whereas the old bottom halves were limited to one type for each bit flag of bh_base.

  • Tasklet 提供两个优先级。

  • Tasklets provide two levels of priority.

  • 不同的tasklet可以在不同的CPU上同时运行。

  • Different tasklets can run concurrently on different CPUs.

  • 与旧的下半部分和软中断不同,Tasklet 是动态的,不需要在或枚举列表中静态声明。XXX _BHXXX _SOFTIRQ

  • Tasklets, unlike old bottom halves and softirqs, are dynamic and do not need to be statically declared in an XXX _BH or XXX _SOFTIRQ enumeration list.

The tasklet_struct数据结构在include/linux/interrupt.h中定义 如下:

The tasklet_struct data structure is defined in include/linux/interrupt.h as follows:

结构体tasklet_struct
{
    结构tasklet_struct *下一个;
    无符号长状态;
    atomic_t 计数;
    void (*func)(无符号长整型);
    无符号长数据;
};
struct tasklet_struct
{
    struct tasklet_struct *next;
    unsigned long state;
    atomic_t count;
    void (*func)(unsigned long);
    unsigned long data;
};

以下是逐个字段的描述:

The following is the field-by-field description:

struct tasklet_struct *next
struct tasklet_struct *next

用于将与同一 CPU 关联的待处理结构链接在一起的指针。tasklet_hi_schedule新元素通过函数和添加到头部tasklet_schedule

A pointer used to link together the pending structures associated with the same CPU. New elements are added at the head by the functions tasklet_hi_schedule and tasklet_schedule.

unsigned long state
unsigned long state

位图标志,其可能值由include/linux/interrupt.hTASKLET_STATE_ XXX中列出的枚举表示:

TASKLET_STATE_SCHED

该微线程已被安排执行,并且根据分配的优先级,数据结构已位于与HI_SOFTIRQ或关联的列表中。TASKLET_SOFTIRQ同一个tasklet不能在不同的CPU上同时调度。如果执行该微线程的其他请求在第一个请求尚未开始执行时到达,则这些请求将被丢弃。由于对于任何给定的 tasklet,只能有一个实例在执行,因此没有理由安排它执行多次。

TASKLET_STATE_RUN

正在执行该tasklet。该标志用于防止同一 tasklet 的多个实例同时执行。它仅对 SMP 系统有意义。该标志通过三个锁定函数tasklet_trylocktasklet_unlock、 和进行操作tasklet_unlock_wait

A bitmap flag whose possible values are represented by the TASKLET_STATE_ XXX enums listed in include/linux/interrupt.h:

TASKLET_STATE_SCHED

The tasklet has been scheduled for execution, and the data structure is already in the list associated with HI_SOFTIRQ or TASKLET_SOFTIRQ, based on the priority assigned. The same tasklet cannot be scheduled concurrently on different CPUs. If other requests to execute the tasklet arrive when the first one has not started its execution yet, they will be dropped. Since for any given tasklet, there can be only one instance in execution, there is no reason to schedule it for execution more than once.

TASKLET_STATE_RUN

The tasklet is being executed. This flag is used to prevent multiple instances of the same tasklet from being executed concurrently. It is meaningful only for SMP systems. The flag is manipulated with the three locking functions tasklet_trylock, tasklet_unlock, and tasklet_unlock_wait.

atomic_t count
atomic_t count

在某些情况下,您可能需要暂时禁用并稍后重新启用微线程。这是通过该计数器完成的:零值意味着该微线程被禁用(因此不可执行),非零意味着该微线程被启用。它的值通过本节后面描述的tasklet[_hi]_enable和函数来递增和递减。tasklet[_hi]_disable

There are cases where you may need to temporarily disable and later re-enable a tasklet. This is accomplished by this counter: a value of zero means that the tasklet is disabled (and thus not executable) and nonzero means the tasklet is enabled. Its value is incremented and decremented by the tasklet[_hi]_enable and tasklet[_hi]_disable functions described later in this section.

void (*func)(unsigned long)
void (*func)(unsigned long)

unsigned long data
unsigned long data

func是要执行的函数, data是可以传递给 的可选输入func

func is the function to execute and data is an optional input that can be passed to func.

以下是一些处理微线程的重要内核函数,来自 kernel/softirq.cinclude/linux/interrupt.h

The following are some important kernel functions that handle tasklets, from kernel/softirq.c and include/linux/interrupt.h:

tasklet_init
tasklet_init

使用作为参数提供的和值填充tasklet_struct 结构的字段。funcdata

Fills in the fields of a tasklet_struct structure with the func and data values provided as arguments.

tasklet_action,tasklet_hi_action
tasklet_action, tasklet_hi_action

分别执行低优先级和高优先级的tasklet。

Execute low-priority and high-priority tasklets, respectively.

tasklet_schedule,tasklet_hi_schedule
tasklet_schedule, tasklet_hi_schedule

分别调度一个低优先级和一个高优先级的tasklet 来执行。他们将该tasklet_struct 结构添加到与本地 CPU 关联的待处理微线程列表中,然后调度关联的软中断(TASKLET_SOFTIRQHI_SOFTIRQ)。如果该 tasklet 已被调度(但未运行),则这些 API 不执行任何操作(请参阅该 TASKLET_STATE_SCHED标志)。

Schedule a low-priority and a high-priority tasklet, respectively, for execution. They add the tasklet_struct structure to the list of pending tasklets associated with the local CPU and then schedule the associated softirq (TASKLET_SOFTIRQ or HI_SOFTIRQ). If the tasklet is already scheduled (but not running), these APIs do nothing (see the TASKLET_STATE_SCHED flag).

tasklet_enable,tasklet_hi_enable
tasklet_enable, tasklet_hi_enable

这两个函数是相同的并且用于启用tasklet。

These two functions are identical and are used to enable a tasklet.

tasklet_disable,tasklet_disable_nosync
tasklet_disable, tasklet_disable_nosync

这两个函数都会禁用微线程,并且可以与低优先级和高优先级微线程一起使用。Tasklet_disable是一个包装器tasklet_disable_nosync。后者立即返回(它是异步的),而前者只有当微线程终止执行时才返回,以防它正在运行(它是同步的)。

tasklet_enabletasklet_hi_enable、 并tasklet_disable_nosync操作该字段的值count来声明该微线程启用或禁用。允许嵌套调用。

Both of these functions disable a tasklet and can be used with low- and high-priority tasklets. Tasklet_disable is a wrapper to tasklet_disable_nosync. While the latter returns right away (it is asynchronous), the former returns only when the tasklet has terminated its execution in case it was running (it is synchronous).

tasklet_enable, tasklet_hi_enable, and tasklet_disable_nosync manipulate the value of the count field to declare the tasklet enabled or disabled. Nested calls are allowed.

软中断初始化

Softirq Initialization

在内核初始化期间, softirq_init使用两个通用软中断初始化软件 IRQ 层:tasklet_action 和,分别与和tasklet_hi_action关联 。TASKLET_SOFTIRQHI_SOFTIRQ

During kernel initialization, softirq_init initializes the software IRQ layer with the two general-purpose softirqs: tasklet_action and tasklet_hi_action, which are associated with TASKLET_SOFTIRQ and HI_SOFTIRQ, respectively.

无效__init软中断_init()
{
    open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
    open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
}
void _ _init softirq_init( )
{
    open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
    open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
}

网络代码使用的两个软中断NET_RX_SOFTIRQ在网络初始化函数之一NET_TX_SOFTIRQ中初始化(请参阅“网络代码如何使用软中断net_dev_init”部分)。

The two softirqs used by the networking code NET_RX_SOFTIRQ and NET_TX_SOFTIRQ are initialized in net_dev_init, one of the networking initialization functions (see the section "How the Networking Code Uses softirqs").

“内核 2.4 及以上版本中的下半部处理程序:软中断的介绍”部分中列出的其他软中断在相关子系统中注册(SCSI_SOFTIRQdrivers/scsi/scsi.c中、TIMER_SOFTIRQkernel/timer.c中等) 。

The other softirqs listed in the section "Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq" are registered in the associated subsystems (SCSI_SOFTIRQ in drivers/scsi/scsi.c, TIMER_SOFTIRQ in kernel/timer.c, etc.).

HI_SOFTIRQ主要由声卡设备驱动程序使用。[ * ]

HI_SOFTIRQ is mainly used by sound card device drivers.[*]

用户TASKLET_SOFTIRQ包括:

Users of TASKLET_SOFTIRQ include:

  • 网络接口卡的驱动程序(不仅仅是以太网)

  • Drivers for network interface cards (not only Ethernets)

  • 许多其他设备驱动程序

  • Numerous other device drivers

  • 媒体层(USB、IEEE 1394 等)

  • Media layers (USB, IEEE 1394, etc.)

  • 网络子系统(Neighboring、ATM qdisc 等)

  • Networking subsystems (Neighboring, ATM qdisc, etc.)

挂起的软中断处理

Pending softirq Handling

我们在“内核 2.4 及以上版本中的下半部处理程序:软中断的引入”一节中解释了何时do_softirq调用来处理待处理的软中断。在这里我们将看到该函数的内部结构。您会注意到它与“内核 2.2 中的下半部处理程序”一节中描述的内核 2.2 使用的处理程序非常相似。

We explained in the section "Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq" when do_softirq is invoked to take care of the pending softirqs. Here we will see the internals of the function. You will notice how much it resembles the one used by kernel 2.2 described in the section "Bottom-half handlers in kernel 2.2."

do_softirq如果 CPU 当前正在处理硬件或软件中断,则停止并且不执行任何操作。该函数通过调用 来检查这一点in_interrupt,这相当于if (in_irq( ) || in_softirq( )).

do_softirq stops and does nothing if the CPU is currently serving a hardware or software interrupt. The function checks for this by calling in_interrupt, which is equivalent to if (in_irq( ) || in_softirq( )).

如果do_softirq决定继续,它将保存挂起的软pending中断local_softirq_pending

If do_softirq decides to proceed, it saves pending softirqs in pending with local_softirq_pending.

#ifndef _ _ARCH_HAS_DO_SOFTIRQ

asmlinkage void do_softirq(void)
{
    如果(in_interrupt())
        返回;

    local_irq_save(标志);
    挂起= local_softirq_pending( );
    如果(待定)
        __do_softirq();
    本地中断恢复;
}

EXPORT_SYMBOL(do_softirq);
#万一
#ifndef _ _ARCH_HAS_DO_SOFTIRQ

asmlinkage void do_softirq(void)
{
    if (in_interrupt( ))
        return;

    local_irq_save(flags);
    pending = local_softirq_pending( );
    if (pending)
        _ _do_softirq( );
    local_irq_restore;
}

EXPORT_SYMBOL(do_softirq);
#endif

从前面的快照来看,似乎是do_softirq在禁用 IRQ 的情况下运行的,但事实并非如此。仅当操作待处理软中断的位图(即访问该 softnet_data结构)时,IRQ 才会保持禁用状态。稍后您将看到_ _do_softirq运行软中断处理程序时在内部重新启用 IRQ。

From the preceding snapshot, it could seem that do_softirq runs with IRQs disabled, but that's not true. IRQs are kept disabled only when manipulating the bitmap of pending softirqs (i.e., accessing the softnet_data structure). You will see in a moment that _ _do_softirq internally re-enables IRQs when running the softirq handlers.

__do_softirq函数

_ _do_softirq function

do_softirq同一软中断类型在运行时可能会被调度多次。由于在运行软中断处理程序时启用 IRQ,因此可以在提供中断服务时操作待处理软中断的位图,因此任何已执行的软中断处理程序都可以在其自身执行期间重新 _ _do_softirq调度_ _do_softirq

It is possible for the same softirq type to be scheduled multiple times while do_softirq is running. Since IRQs are enabled when running the softirq handlers, the bitmap of pending softirq can be manipulated while serving an interrupt, and therefore any of the softirq handlers that has been executed by _ _do_softirq could be re-scheduled during the execution of _ _do_softirq itself.

因此,在_ _do_softirq 重新启用 IRQ 之前,它将挂起的软中断的当前位图保存在本地变量上,pendingsoftnet_data使用 . 从与本地 CPU 关联的实例 中清除它local_softirq_pending( )=0。然后根据 pending,它调用所有必要的处理程序。

For this reason, before _ _do_softirq re-enables IRQs, it saves the current bitmap of the pending softirq on the local variable pending and clears it from the softnet_data instance associated with the local CPU using local_softirq_pending( )=0. Then based on pending, it calls all the necessary handlers.

调用所有处理程序后,_ _do_softirq检查是否同时再次安排了任何软中断(此请求禁用 IRQ)。如果至少有一个挂起的软中断,它将重复整个过程。然而,_ _do_softirq 最多重复几次MAX_SOFTIRQ_RESTART(实验发现10次效果很好)。

Once all the handlers have been called, _ _do_softirq checks whether in the meantime any softirqs were scheduled again (this request disables IRQs). If there was at least one pending softirq, it will repeat the whole process. However, _ _do_softirq repeats it only up to MAX_SOFTIRQ_RESTART times (experimentation has found that 10 times works well).

使用MAX_SOFTIRQ_RESTART是一项设计决策,旨在防止单一类型的中断(尤其是网络中断流)导致其中一个 CPU 的其他中断匮乏。如果没有 _ _ 的限制do_softirq,当服务器的网络流量负载很高且NET_RX_SOFTIRQ中断数量激增时,很容易发生饥饿。

The use of MAX_SOFTIRQ_RESTART is a design decision made to keep a single type of interrupt—particularly a stream of networking interrupts—from starving other interrupts out of one of the CPUs. Without the limit in _ _do_softirq, starvation could easily happen when a server is highly loaded by network traffic and the number of NET_RX_SOFTIRQ interrupts goes through the roof.

让我们看看饥饿是如何发生的。do_IRQ会引发一个NET_RX_SOFTIRQ 导致do_softirq执行的中断。_ _do_softirq会清除NET_RX_SOFTIRQ标志,但在结束之前它会被再次设置的中断打断NET_RX_SOFTIRQ,依此类推,无限期地。

Let's see how starvation could take place. do_IRQ would raise a NET_RX_SOFTIRQ interrupt that would cause do_softirq to be executed. _ _do_softirq would clear the NET_RX_SOFTIRQ flag, but before it ended it would be interrupted by an interrupt that would set NET_RX_SOFTIRQ again, and so on, indefinitely.

现在让我们看看核心部分如何_ _do_softirq调用软中断处理程序。每次服务一种软中断类型时,它的位都会从活动软中断的本地副本中清除 pendingh被初始化为指向全局数据结构,softirq_vec该结构保存软中断类型与其函数处理程序之间的关联(例如,NET_RX_SOFTIRQ由 处理net_rx_action)。当位图被清除时循环结束。

Let's see now how the central part of _ _do_softirq manages to invoke the softirq handler. Every time one softirq type is served, its bit is cleared from the local copy of the active softirqs, pending. h is initialized to point to the global data structure softirq_vec that holds the associations between softirq types and their function handlers (for instance, NET_RX_SOFTIRQ is handled by net_rx_action). The loop ends when the bitmap is cleared.

最后,如果存在因do_softirq必须返回而无法处理的待处理软中断,则线程已重复其工作MAX_SOFTIRQ_RESTART时间,该ksoftirqd线程将被唤醒并负责稍后处理它们。由于do_softirq在内核中的许多点上调用 ,因此实际上很可能稍后的调用将在调度线程do_softirq之前处理这些中断。ksoftirqd

Finally, if there are pending softirqs that cannot be handled because do_softirq must return, having repeated its job MAX_SOFTIRQ_RESTART times already, the ksoftirqd thread is awakened and given the responsibility of handling them later. Because do_softirq is invoked at so many points within the kernel, it is actually likely that a later invocation of do_softirq will handle these interrupts before the ksoftirqd thread is scheduled.

#define MAX_SOFTIRQ_RESTART 10

asmlinkage void _ _do_softirq(void)
{
    结构体softirq_action *h;
    _ _u32 待定;
    int max_restart = MAX_SOFTIRQ_RESTART;
    中央处理器;

    挂起= local_softirq_pending( );

    local_bh_disable();
    cpu = smp_processor_id( );
重新开始:
    /* 在启用 irqs 之前重置挂起的位掩码 */
    local_softirq_pending() = 0;

    local_irq_enable();

    h = 软中断向量;

    做 {
        如果(待定&1){
            h->动作(h);
            rcu_bh_qsctr_inc(CPU);
        }
        h++;
        待处理 >>= 1;
    同时(待定);

    local_irq_disable();

    挂起= local_softirq_pending( );
    if (待处理 && --max_restart)
        转到重新启动;

    如果(待定)
        wakeup_softirqd();

    _ _local_bh_enable();
}
#define MAX_SOFTIRQ_RESTART 10

asmlinkage void _ _do_softirq(void)
{
    struct softirq_action *h;
    _ _u32 pending;
    int max_restart = MAX_SOFTIRQ_RESTART;
    int cpu;

    pending = local_softirq_pending( );

    local_bh_disable( );
    cpu = smp_processor_id( );
restart:
    /* Reset the pending bitmask before enabling irqs */
    local_softirq_pending( ) = 0;

    local_irq_enable( );

    h = softirq_vec;

    do {
        if (pending & 1) {
            h->action(h);
            rcu_bh_qsctr_inc(cpu);
        }
        h++;
        pending >>= 1;
    } while (pending);

    local_irq_disable( );

    pending = local_softirq_pending( );
    if (pending && --max_restart)
        goto restart;

    if (pending)
        wakeup_softirqd( );

    _ _local_bh_enable( );
}

softirq 的按架构处理

Per-Architecture Processing of softirq

kernel/softirq.cdo_softirq中提供的函数可以被体系结构代码提供的另一个函数覆盖(无论如何最终都会调用)。这解释了为什么kernel/softirq.c中的定义包含在检查中(请参阅上一节)。_ _do_softirqdo_softirq_ _ARCH_HAS_DO_SOFTIRQ

The do_softirq function provided in kernel/softirq.c can be overridden by another function provided by the architecture code (which ends up calling _ _do_softirq anyway). This explains why the definition of do_softirq in kernel/softirq.c is wrapped with the check on _ _ARCH_HAS_DO_SOFTIRQ (see the previous section).

一些体系结构,包括 i386(请参阅arch/i386/kernel/irq.c),定义了自己的do_softirq. 当架构使用 4 KB 堆栈(而不是 8 KB)并使用剩余的 4 K 来实现硬 IRQ 和软中断的堆栈处理时,使用此类架构版本。请参阅了解 Linux 内核了解更多详细信息。

A few architectures, including i386 (see arch/i386/kernel/irq.c), define their own version of do_softirq. Such architecture versions are used when the architectures use 4 KB stacks (instead of 8 KB) and use the remaining 4 K to implement stacked handling of both hard IRQs and softirqs. Please refer to Understanding the Linux Kernel for more detail.

ksoftirqd 内核线程

ksoftirqd Kernel Threads

后台内核线程被分配的任务是检查前面描述的函数未执行的软中断,并在需要将该 CPU 返回给其他活动之前执行尽可能多的软中断。每个 CPU 有一个内核线程,名为ksoftirqd_CPU0ksoftirqd_CPU1、等等。“启动线程”部分描述了如何在 CPU 启动时启动这些线程。

Background kernel threads are assigned the job of checking for softirqs that have been left unexecuted by the functions previously described, and executing as many of those softirqs as they can before needing to give that CPU back to other activities. There is one kernel thread for each CPU, named ksoftirqd_CPU0, ksoftirqd_CPU1, and so on. The section "Starting the threads" describes how these threads are started at CPU boot time.

与这些线程关联的函数非常简单,并在同一文件softirq.cksoftirqd中定义:

The function ksoftirqd associated to these threads is pretty simple and is defined in the same file softirq.c:

静态 int ksoftirqd(void * _ _bind_cpu)
{
    set_user_nice(当前, 19);
    ...
    而(!kthread_should_stop()){
        if (!local_softirq_pending( ))
            日程( );

        _ _set_current_state(TASK_RUNNING);

        while (local_softirq_pending()) {
            /* 抢占禁用阻止 cpu 离线。
               如果已经离线,我们将使用错误的 CPU:
               不处理*/
            preempt_disable();
            if (cpu_is_offline((long)__bind_cpu))
                转到等待死亡;
            do_softirq();
            preempt_enable();
            cond_resched();
        }
        set_current_state(TASK_INTERRUPTIBLE);
    }
    _ _set_current_state(TASK_RUNNING);
    返回0;
    ...
}
static int ksoftirqd(void * _ _bind_cpu)
{
    set_user_nice(current, 19);
    ...
    while (!kthread_should_stop( )) {
        if (!local_softirq_pending( ))
            schedule( );

        _ _set_current_state(TASK_RUNNING);

        while (local_softirq_pending( )) {
            /* Preempt disable stops cpu going offline.
               If already offline, we'll be on wrong CPU:
               don't process */
            preempt_disable( );
            if (cpu_is_offline((long)_ _bind_cpu))
                goto wait_to_die;
            do_softirq( );
            preempt_enable( );
            cond_resched( );
        }
        set_current_state(TASK_INTERRUPTIBLE);
    }
    _ _set_current_state(TASK_RUNNING);
    return 0;
    ...
}

我想强调几个小细节。进程的优先级,也称为良好优先级,是一个范围从 -20(最大)到 +19(最小)的数字。线程ksoftirqd的优先级较低,为 19。这样做是为了避免频繁运行的软中断(例如,软中断)NET_RX_SOFTIRQ无法完全劫持 CPU,从而几乎不会为其他进程留下任何资源。我们已经看到do_softirq可以从代码中的不同位置调用它,因此这种低优先级并不代表一个障碍。一旦启动,循环就会继续调用 do_softirq(始终禁用抢占),直到满足以下条件之一:

There are a couple of small details I want to emphasize. The priority of a process, also called the nice priority, is a number ranging from -20 (maximum) to +19 (minimum). The ksoftirqd threads are given a low priority of 19. This is done so that frequently running softirqs such as NET_RX_SOFTIRQ cannot completely kidnap the CPU, which would leave almost no resources to other processes. We already saw that do_softirq can be invoked from different places in the code, so this low priority doesn't represent a handicap. Once started, the loop simply keeps calling do_softirq (always with preemption disabled) until one of the following conditions becomes true:

  • 没有更多待处理的软中断需要处理(local_softirq_pending( )返回 FALSE)。

    在这种情况下,该函数将线程的状态设置为TASK_INTERRUPTIBLE并调用schedule( )以释放 CPU。线程可以通过 来唤醒,既可以从自身wakeup_softirqd调用,也可以从 调用。_ _do_softirqraise_softirq_irqoff

  • There are no more pending softirqs to handle (local_softirq_pending( ) returns FALSE).

    In this case, the function sets the thread's state to TASK_INTERRUPTIBLE and calls schedule( ) to release the CPU. The thread can be awakened by means of wakeup_softirqd, which can be called from both _ _do_softirq itself and raise_softirq_irqoff.

  • 线程运行时间过长,被要求释放CPU。

    除其他事项外,与计时器中断关联的处理程序设置标志 need_resched以表明当前进程/线程已使用其时隙。在这种情况下,ksoftirqd释放CPU,保持其状态为TASK_RUNNING,并且很快就会恢复。

  • The thread has run for too long and is asked to release the CPU.

    The handler associated with the timer interrupt, among other things, sets the need_resched flag to signal that the current process/thread has used its time slot. In this case, ksoftirqd releases the CPU, keeping its state as TASK_RUNNING, and will soon be resumed.

启动线程

Starting the threads

ksoftirqd每个CPU有一个线程。当系统的第一个CPU上线时,第一个线程会在内核启动时启动do_pre_smp_initcalls [ * ]启动时出现的其他 CPU 的线程ksoftirqd,以及稍后可能在可以处理可热插拔 CPU 的系统上启用的任何其他 CPU 的线程,均通过通知链进行处理cpu_chain

There is one ksoftirqd thread for each CPU. When the system's first CPU comes online, the first thread is started at kernel boot time inside do_pre_smp_initcalls. [*] The ksoftirqd threads for the other CPUs that come up at boot time, and for any other CPU that may be enabled later on a system that can handle hot-pluggable CPUs, are taken care of through the cpu_chain notification chain.

通知链在第 4 章中介绍。该 cpu_chain链让各个子系统知道 CPU 何时启动并运行或何时死亡。softirq 子系统注册到 cpu_chainwith spawn_ksoftirqd,从前面提到的函数调用do_pre_smp_initcallscpu_callback处理通知的回调例程cpu_chain用于初始化必要的每 CPU 数据结构并启动ksoftirqdCPU 上的线程。

Notification chains were introduced in Chapter 4. The cpu_chain chain lets various subsystems know when a CPU is up and running or when one dies. The softirq subsystem registers to the cpu_chain with spawn_ksoftirqd, called from the function do_pre_smp_initcalls mentioned previously. The callback routine cpu_callback that processes notifications from cpu_chain is used to initialize the necessary per-CPU data structures and start the ksoftirqd thread on the CPU.

完整的通知列表CPU_ XXX位于include/linux/notifier.h中,但在本章中我们只需要其中四个:

The complete list of CPU_ XXX notifications is in include/linux/notifier.h, but we need only four of them in the context of this chapter:

CPU_UP_PREPARE
CPU_UP_PREPARE

当CPU开始启动但尚未准备好时生成。

Generated when the CPU starts coming up, but is not ready yet.

CPU_ONLINE
CPU_ONLINE

当 CPU 准备就绪时生成。

Generated when the CPU is ready.

CPU_UP_CANCELLED
CPU_UP_CANCELLED

CPU_DEAD
CPU_DEAD

仅当内核编译为支持热插拔 CPU 时,才会生成这两条消息。第一个用于当先前通知触发的任务之一CPU_UP_PREPARE 失败并因此不允许 CPU 上线时。第二个是在 CPU 死机时使用的。

These two messages are generated only when the kernel is compiled with support for hot-pluggable CPUs. The first is used when one of the tasks triggered by a previous CPU_UP_PREPARE notification failed and therefore does not allow the CPU to go online. The second one is used when a CPU dies.

CPU_PREPARE_UP创建线程并将其绑定到关联的CPU,但不唤醒线程。CPU_ONLINE唤醒线程。当 CPU 死亡时,其关联的ksoftirqd实例将被终止:

CPU_PREPARE_UP creates the thread and binds it to the associated CPU, but does not wake up the thread. CPU_ONLINE wakes up the thread. When a CPU dies, its associated ksoftirqd instance is killed:

static int _ _devinit cpu_callback(struct notifier_block *nfb, 无符号长操作,
无效*hcpu)
{
    ...
    开关(动作){
        ...
    }
    返回NOTIFY_OK;
}

静态结构notifier_block _ _devinitdata cpu_nfb = {
    .notifier_call = cpu_callback
};

_ _init int spawn_ksoftirqd(void)
{
    void *cpu = (void *)(long)smp_processor_id( );
    cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
    cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
    register_cpu_notifier(&cpu_nfb);
    返回0;
}
static int _ _devinit cpu_callback(struct notifier_block *nfb, unsigned long action,
void *hcpu)
{
    ...
    switch(action) {
        ...
    }
    return NOTIFY_OK;
}

static struct notifier_block _ _devinitdata cpu_nfb = {
    .notifier_call = cpu_callback
};

_ _init int spawn_ksoftirqd(void)
{
    void *cpu = (void *)(long)smp_processor_id( );
    cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
    cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
    register_cpu_notifier(&cpu_nfb);
    return 0;
}

请注意,在注册via之前,spawn_ksoftirqd需要进行两次直接调用 。这是必要的,因为不会为第一个联机的 CPU 生成 CPU 通知。cpu_callbackcpu_chainregister_cpu_notifier

Note that spawn_ksoftirqd places two direct calls to cpu_callback before registering with cpu_chain via register_cpu_notifier. This is necessary because CPU notifications are not generated for the first CPU that comes online.

小任务处理

Tasklet Processing

低延迟微线程 ( TASKLET_SOFTIRQ) 和高延迟微线程 ( HI_SOFTIRQ) 的两个处理程序是相同的;他们只是在两个不同的列表上工作。因此,我们将仅描述其中之一:tasklet_action,即与 相关的那个 TASKLET_SOFTIRQ

The two handlers for low-latency tasklets (TASKLET_SOFTIRQ) and high-latency tasklets (HI_SOFTIRQ) are identical; they simply work on two different lists. For this reason, we will describe only one of them: tasklet_action, the one associated with TASKLET_SOFTIRQ.

任一时刻,每个tasklet 只能有一个实例等待执行。当 tasklet_scheduletasklet_hi_schedule调度一个微线程时,该函数会设置前面“微线程TASKLET_STATE_SCHED”部分中描述的位。尝试重新调度相同的 tasklet 将被忽略,因为已经设置了。仅当微线程开始执行时,该位才被清除;因此,在其执行期间或之后可以调度另一个实例。TASKLET_STATE_SCHED

Only one instance of each tasklet can be waiting for execution at any time. When tasklet_schedule or tasklet_hi_schedule schedules a tasklet, the function sets the TASKLET_STATE_SCHED bit described earlier in the section "Tasklets." Attempts to reschedule the same tasklet will be ignored because TASKLET_STATE_SCHED is already set. The bit is cleared only when the tasklet starts its execution; thus, during or after its execution another instance can be scheduled.

tasklet_action函数首先将等待处理的微线程列表复制到局部变量中;然后它会清除全局列表。[ * ]这是在禁用中断的情况下执行的函数的唯一部分。禁用它们对于避免中断处理程序的竞争条件是必要的,中断处理程序可能会在tasklet_action访问列表时向列表添加新元素。

The tasklet_action function starts by copying the list of tasklets waiting to be processed into a local variable first; it then clears the global list.[*] This is the only part of the function that is executed with interrupts disabled. Disabling them is necessary to avoid race conditions with interrupt handlers that could add new elements to the list while tasklet_action accesses it.

此时,该函数逐个tasklet 遍历列表。对于每个元素,如果以下两个条件均为 true,则它会调用处理程序:

At this point, the function goes through the list tasklet by tasklet. For each element it invokes the handler if both of the following are true:

  • 该tasklet 尚未运行——换句话说,TASKLET_STATE_RUN它是明确的。(该函数运行tasklet_trylock以查看是否TASKLET_STATE_RUN已设置;如果没有,则tasklet_trylock设置该位。)

  • The tasklet is not already running—in other words, TASKLET_STATE_RUN is clear. (The function runs tasklet_trylock to see whether TASKLET_STATE_RUN is already set; if not; tasklet_trylock sets the bit.)

  • tasklet 已启用(count为零)。

  • The tasklet is enabled (count is zero).

实现这些活动的函数部分如下:

The part of the function implementing these activities follows:

    结构tasklet_struct *列表;

    local_irq_disable();
    列表 = _ _get_cpu_var(tasklet_vec).list;
    _ _get_cpu_var(tasklet_vec).list = NULL;
    local_irq_enable();

    而(列表){
        struct tasklet_struct *t = 列表;

        列表=列表->下一个;

        if (tasklet_trylock(t)) {
            if (!atomic_read(&t->count)) {
    struct tasklet_struct *list;

    local_irq_disable( );
    list = _ _get_cpu_var(tasklet_vec).list;
    _ _get_cpu_var(tasklet_vec).list = NULL;
    local_irq_enable( );

    while (list) {
        struct tasklet_struct *t = list;

        list = list->next;

        if (tasklet_trylock(t)) {
            if (!atomic_read(&t->count)) {

在此阶段,由于该微线程尚未被执行,并且它是从挂起的微线程列表中提取的,因此它必须设置标志TASKLET_STATE_SCHED

At this stage, since the tasklet was not already being executed and it was extracted from the list of pending tasklets, it must have the TASKLET_STATE_SCHED flag set:

                if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
                    漏洞( );
                t->func(t->数据);
                tasklet_unlock(t);
                继续;
            }
            tasklet_unlock(t);
        }
                if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
                    BUG( );
                t->func(t->data);
                tasklet_unlock(t);
                continue;
            }
            tasklet_unlock(t);
        }

如果处理程序无法执行,则该微线程将被放回列表中并 TASKLET_SOFTIRQ重新调度,以处理由于前面列出的两个原因之一而现在无法处理的所有微线程:

If the handler cannot be executed, the tasklet is put back into the list and TASKLET_SOFTIRQ is rescheduled to take care of all of those tasklets that for one of the two reasons listed earlier cannot be handled now:

        local_irq_disable();
        t->next = _ _get_cpu_var(tasklet_vec).list;
        _ _get_cpu_var(tasklet_vec).list = t;
        _ _raise_softirq_irqoff(TASKLET_SOFTIRQ);
        local_irq_enable();
    }
}
        local_irq_disable( );
        t->next = _ _get_cpu_var(tasklet_vec).list;
        _ _get_cpu_var(tasklet_vec).list = t;
        _ _raise_softirq_irqoff(TASKLET_SOFTIRQ);
        local_irq_enable( );
    }
}

网络代码如何使用软中断

How the Networking Code Uses softirqs

网络子系统已分配了两个不同的软中断。NET_RX_SOFTIRQ处理传入流量并NET_TX_SOFTIRQ处理传出流量。两者都通过以下行注册net_dev_init(在 第 5 章中描述):

The networking subsystem has been assigned two different softirqs. NET_RX_SOFTIRQ handles incoming traffic and NET_TX_SOFTIRQ handles outgoing traffic. Both are registered in net_dev_init (described in Chapter 5) through the following lines:

open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);
open_softirq(NET_RX_SOFTIRQ, net_rx_action, NULL);
open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);
open_softirq(NET_RX_SOFTIRQ, net_rx_action, NULL);

由于同一软中断处理程序的不同实例可以在不同的 CPU 上同时运行(与微线程不同),因此网络代码延迟低且可扩展。

Because different instances of the same softirq handler can run concurrently on different CPUs (unlike tasklets), networking code is both low latency and scalable.

两个网络软中断的优先级均高于普通微线程 ( TASKLET_SOFTIRQ),但低于高优先级微线程 ( HI_SOFTIRQ)。即使系统处于高网络负载下,这种优先级也能保证其他高优先级任务能够以响应迅速且及时的方式进行。

Both networking softirqs are higher in priority than normal tasklets (TASKLET_SOFTIRQ) but are lower in priority than high-priority tasklets (HI_SOFTIRQ). This prioritization guarantees that other high-priority tasks can proceed in a responsive and timely manner even when a system is under a high network load.

第 10 章的“处理 NET_RX_SOFTIRQ:net_rx_action ”和第 11 章的“处理 NET_TX_SOFTIRQ:net_tx_action ”部分介绍了这两个处理程序的内部结构。

The internals of the two handlers are covered in the sections "Processing the NET_RX_SOFTIRQ: net_rx_action" in Chapter 10 and "Processing the NET_TX_SOFTIRQ: net_tx_action" in Chapter 11.

softnet_data结构

softnet_data Structure

我们将在第 10 章中看到,每个 CPU 都有自己的传入帧队列。由于每个CPU都有自己的数据结构来管理入口和出口流量,因此不需要在不同CPU之间进行任何锁定。该队列的数据结构在include/linux/netdevice.hsoftnet_data中定义如下:

We will see in Chapter 10 that each CPU has its own queue for incoming frames . Because each CPU has its own data structure to manage ingress and egress traffic, there is no need for any locking among different CPUs. The data structure for this queue, softnet_data, is defined in include/linux/netdevice.h as follows:

结构体softnet_data
{
    节气门;
    int cng_level;
    int avg_blog;
    结构sk_buff_head input_pkt_queue;
    struct list_head poll_list;
    结构体net_device *output_queue;
    struct sk_buff *completion_queue;
    结构体net_device backlog_dev;
}
struct softnet_data
{
    int            throttle;
    int            cng_level;
    int            avg_blog;
    struct sk_buff_head    input_pkt_queue;
    struct list_head       poll_list;
    struct net_device      *output_queue;
    struct sk_buff         *completion_queue;
    struct net_device      backlog_dev;
}

该结构包括用于接收的字段和用于发送的字段。换句话说,theNET_RX_SOFTIRQNET_TX_SOFTIRQsoftirqs 都指的是该结构。入口帧排队到[ * ]input_pkt_queue,出口帧被放置到由流量控制(QoS 层)处理的专用队列中,而不是由软中断和结构处理,但软中断仍然用于随后清理传输的缓冲区,以防止该任务减慢传输速度。 softnet_data

The structure includes both fields used for reception and fields used for transmission. In other words, both the NET_RX_SOFTIRQ and NET_TX_SOFTIRQ softirqs refer to the structure. Ingress frames are queued to input_pkt_queue, [*] and egress frames are placed into the specialized queues handled by Traffic Control (the QoS layer) instead of being handled by softirqs and the softnet_data structure, but softirqs are still used to clean up transmitted buffers afterward, to keep that task from slowing transmission.

softnet_data 的字段

Fields of softnet_data

下面逐字段简单描述一下这个数据结构;详细内容将在后面的章节中给出。一些驱动程序使用NAPI接口,而另一些驱动程序尚未更新为NAPI;两种类型的驱动程序都使用此结构,但某些字段是为非 NAPI 驱动程序保留的。

The following is a brief field-by-field description of this data structure; details will be given in later chapters. Some drivers use the NAPI interface, whereas others have not yet been updated to NAPI; both types of driver use this structure, but some fields are reserved for the non-NAPI drivers.

throttle
throttle

avg_blog
avg_blog

cng_level
cng_level

这三个参数由拥塞管理算法使用,并在该列表之后以及第 10 章的“拥塞管理”部分中进行了进一步描述。默认情况下,所有这三个都会随着每一帧的接收而更新。

These three parameters are used by the congestion management algorithm and are further described following this list, as well as in the "Congestion Management" section in Chapter 10. All three, by default, are updated with the reception of every frame.

input_pkt_queue
input_pkt_queue

该队列在 中初始化net_dev_init,是传入帧在由驱动程序处理之前存储的位置。它由非 NAPI 驱动程序使用;那些已经升级到NAPI的使用他们自己的私有队列。

This queue, initialized in net_dev_init, is where incoming frames are stored before being processed by the driver. It is used by non-NAPI drivers; those that have been upgraded to NAPI use their own private queues.

backlog_dev
backlog_dev

这是一个完整的嵌入式数据结构(不仅仅是一个指向该数据结构的指针) ,它表示已计划在关联的 CPU 上执行的net_device设备。net_rx_action该字段由非 NAPI 驱动程序使用。该名称代表“积压设备”。您将在第 10 章的“设备驱动程序和内核之间的旧接口:netif_rx 的第一部分”一节中看到它是如何使用的。

This is an entire embedded data structure (not just a pointer to one) of type net_device, which represents a device that has scheduled net_rx_action for execution on the associated CPU. This field is used by non-NAPI drivers. The name stands for "backlog device." You will see how it is used in the section "Old Interface Between Device Drivers and Kernel: First Part of netif_rx" in Chapter 10.

poll_list
poll_list

这是具有等待处理的输入帧的设备的双向列表。更多细节可以在第10章的“处理NET_RX_SOFTIRQ:net_rx_action ”部分找到。

This is a bidirectional list of devices with input frames waiting to be processed. More details can be found in the section "Processing the NET_RX_SOFTIRQ: net_rx_action" in Chapter 10.

output_queue
output_queue

completion_queue
completion_queue

output_queue是有东西要传输的设备列表,completion_queue是已成功传输并因此可以释放的缓冲区列表。更多详细信息请参见第 11 章的“处理 NET_TX_SOFTIRQ:net_tx_action ”部分。

output_queue is the list of devices that have something to transmit, and completion_queue is the list of buffers that have been successfully transmitted and therefore can be released. More details are given in the section "Processing the NET_TX_SOFTIRQ: net_tx_action" in Chapter 11.

throttle被视为布尔变量,当 CPU 过载时其值为 true,否则为 false。它的值取决于 中的帧数input_pkt_queue。设置该标志后 throttle,无论队列中有多少帧,该 CPU 接收到的所有输入帧都将被丢弃。[ * ]

throttle is treated as a Boolean variable whose value is true when the CPU is overloaded and false otherwise. Its value depends on the number of frames in input_pkt_queue. When the throttle flag is set, all input frames received by this CPU are dropped, regardless of the number of frames in the queue.[*]

avg_blog表示队列长度的加权平均值input_pkt_queue;它的范围可以从 0 到 表示的最大长度netdev_max_backlogavg_blog用于计算cng_level.

avg_blog represents the weighted average value of the input_pkt_queue queue length; it can range from 0 to the maximum length represented by netdev_max_backlog. avg_blog is used to compute cng_level.

cng_level表示拥塞程度,可以取图9-4所示的任意值。当avg_blog达到图中所示的阈值之一时,cng_level值会发生变化。枚举值的定义NET_RX_ XXX位于include/linux/netdevice.h中,拥塞级别 mod_conglo_cong和的定义no_cong位于net/core/dev.c中。[ ]/DROP括号(和 )内的字符串在第 10 章“拥塞管理/HIGH部分中进行了解释。和avg_blogcng_level默认情况下,每帧都会重新计算,但重新计算可以推迟并与计时器绑定,以避免增加太多开销。

cng_level, which represents the congestion level, can take any of the values shown in Figure 9-4. As avg_blog hits one of the thresholds shown in the figure, cng_level changes value. The definitions of the NET_RX_ XXX enum values are in include/linux/netdevice.h, and the definitions of the congestion levels mod_cong, lo_cong, and no_cong are in net/core/dev.c.[] The strings within brackets (/DROP and /HIGH) are explained in the section "Congestion Management" in Chapter 10. avg_blog and cng_level are recalculated with each frame, by default, but recalculation can be postponed and tied to a timer to avoid adding too much overhead.

基于平均积压 avg_blog 的拥塞级别 (NET_RX_XXX)

图 9-4。基于平均积压 avg_blog 的拥塞级别 (NET_RX_XXX)

Figure 9-4. Congestion level (NET_RX_XXX) based on the average backlog avg_blog

avg_blogcng_level与 CPU 关联,因此适用于非 NAPI 设备,这些设备共享input_pkt_queue每个 CPU 使用的队列。

avg_blog and cng_level are associated with the CPU and therefore apply to non-NAPI devices, which share the queue input_pkt_queue that is used by each CPU.

softnet_data 的初始化

Initialization of softnet_data

每个 CPU 的 softnet_data结构都由 初始化net_dev_init,它在启动时运行,并在 第 5 章中进行了描述。初始化代码为:

Each CPU's softnet_data structure is initialized by net_dev_init, which runs at boot time and is described in Chapter 5. The initialization code is:

    for (i = 0; i < NR_CPUS; i++) {
        结构体softnet_data *队列;

        队列 = &per_cpu(softnet_data,i);
        skb_queue_head_init(&queue->input_pkt_queue);
        队列->节流= 0;
        队列->cng_level = 0;
        队列->avg_blog = 10; /* 任意非零 */
        队列->completion_queue = NULL;
        INIT_LIST_HEAD(&queue->poll_list);
        set_bit(_ _LINK_STATE_START, &queue->backlog_dev.state);
        队列->backlog_dev.weight = Weight_p;
        队列->backlog_dev.poll = process_backlog;
        atomic_set(&queue->backlog_dev.refcnt, 1);
    }
    for (i = 0; i < NR_CPUS; i++) {
        struct softnet_data *queue;

        queue = &per_cpu(softnet_data,i);
        skb_queue_head_init(&queue->input_pkt_queue);
        queue->throttle = 0;
        queue->cng_level = 0;
        queue->avg_blog = 10; /* arbitrary non-zero */
        queue->completion_queue = NULL;
        INIT_LIST_HEAD(&queue->poll_list);
        set_bit(_ _LINK_STATE_START, &queue->backlog_dev.state);
        queue->backlog_dev.weight = weight_p;
        queue->backlog_dev.poll = process_backlog;
        atomic_set(&queue->backlog_dev.refcnt, 1);
    }

NR_CPUS是 Linux 内核可以处理的最大 CPU 数量,并且softnet_data是结构体向量 struct softnet_data

NR_CPUS is the maximum number of CPUs the Linux kernel can handle and softnet_data is a vector of struct softnet_data structures.

该代码还初始化 的字段softnet_data->blog_dev(类型为 的结构net_device,代表非 NAPI 设备的特殊设备)。第 10 章中的“积压处理:process_backlog Poll 虚拟函数”一节描述了如何使用旧接口透明地处理非 NAPI 设备驱动程序 。netif_rx

The code also initializes the fields of softnet_data->blog_dev, a structure of type net_device, a special device representing non-NAPI devices. The section "Backlog Processing: The process_backlog Poll Virtual Function" in Chapter 10 describes how non-NAPI device drivers are handled transparently with the old netif_rx interface.




[ * ]帧可能因多种原因而被丢弃:输入队列中没有内存、输出队列中没有内存(仅适用于转发或传输的帧)、没有到达目的地的路由、防火墙策略、健全性检查失败等。

[*] Frames can be dropped for a variety of reasons: no memory in the input queue, no memory in the output queue (only for forwarded or transmitted frames), no route to destination, firewall policy, a failed sanity check, etc.

[ * ]这个讨论主要适用于以太网设备,由于它们使用的拥塞算法,它们已经不能保证传输时间(因此也不能保证接收时间)的上限。

[*] This discussion applies mainly to Ethernet devices, which already do not guarantee an upper bound on the transmission time (and therefore on the reception) because of the congestion algorithm they use.

[ * ]这不是一个简单的驱动程序。建议首先浏览本书这一部分的其他三章。

[*] This is not a trivial driver. Going through the other three chapters of this part of the book first is advisable.

[ * ]vortex_rx作为输入参数传递设备,因为设备驱动程序可以处理相同设备类型或系列的更多实例。因此,当它被调用时,它需要知道它正在处理哪个设备。

[*] vortex_rx is passed the device as an input parameter because a device driver can handle more instances of the same device type or family. Therefore, when it is invoked it needs to know which device it is dealing with.

[ * ]我们在第 5 章中看到,声明为慢速处理程序的中断处理程序是在本地 CPU 上启用中断的情况下执行的。

[*] We saw in Chapter 5 that an interrupt handler that is declared as a slow handler is executed with the interrupts enabled on the local CPU.

[ * ]PREEMPT_ACTIVE标志是根据每个体系结构定义的。该图显示了最常见的定义。

[*] The PREEMPT_ACTIVE flag is defined on a per-architecture basis. The figure shows the most common definition.

[ * ]已在 2.4.6 中删除。

[*] It has been removed in 2.4.6.

[ * ]也可以不invoke_softirq 直接调用do_softirq。前者可以是do_softirq其辅助例程 的别名_ _do_softirq,具体取决于_ _ARCHIRQ_EXIT_IRQS_DISABLED 符号是否已定义。

[*] It is also possible to call invoke_softirq instead of do_softirq directly. The former could be an alias to do_softirq or to its helper routine, _ _do_softirq, depending on whether the _ _ARCHIRQ_EXIT_IRQS_DISABLED symbol is defined.

[ * ]内核也提供工作队列。我们不会介绍它们,因为网络代码很少使用它们。有关工作队列的讨论,请参阅了解 Linux 内核。

[*] The kernel provides work queues as well. We will not cover them because they are not used much by the networking code. Refer to Understanding the Linux Kernel for a discussion of work queues.

[ * ]在 2.4 内核中,2.2 版内核的所有下半部处理程序都通过mark_bh将函数定义为tasklet_hi_schedule.

[*] In 2.4 kernels, all the bottom-half handlers of kernel version 2.2 were converted to high-priority tasklets by defining the mark_bh function as a wrapper around tasklet_hi_schedule.

[ * ]有关内核如何在启动时处理基本初始化的详细信息,请参阅第 7 章。

[*] See Chapter 7 for details about how the kernel takes care of basic initializations at boot time.

[ * ]我们将看到网络软中断处理程序之一 ( net_tx_action) 执行类似的操作。

[*] We will see that one of the networking softirq handlers (net_tx_action) does something similar.

[ * ]您将在第 10 章中看到,对于使用 NAPI 的驱动程序来说,情况不再如此。

[*] You will see in Chapter 10 that this is no longer true for drivers using NAPI.

[ * ]在这些条件下,使用 NAPI 的驱动程序可能不会丢弃传入流量。

[*] Drivers using NAPI might not drop incoming traffic under these conditions.

[ ]这些NET_RX_ XXX值也在此上下文之外使用,并且还有其他NET_RX_ XXX值未在此处使用。该值no_cong_thresh未被使用;它曾经被用来 process_backlog(在第 10 章中描述)在某些情况下当内核仍然支持该功能(已被删除)时从限制状态中删除队列。

[] The NET_RX_ XXX values are also used outside this context, and there are other NET_RX_ XXX values not used here. The value no_cong_thresh is not used; it used to be used by process_backlog (described in Chapter 10) to remove a queue from the throttle state under some conditions when the kernel still had support for the feature (which has been dropped).

第 10 章帧接收

Chapter 10. Frame Reception

在上一章中,我们看到L2层处理帧的函数是由中断驱动的。在本章中,我们开始讨论帧接收,其中硬件使用中断向 CPU 发送有关帧可用性的信号。

In the previous chapter, we saw that the functions that deal with frames at the L2 layer are driven by interrupts. In this chapter, we start our discussion about frame reception, where the hardware uses an interrupt to signal the CPU about the availability of the frame.

如图9-2所示,收到中断的CPU执行该do_IRQ函数。IRQ 号导致调用正确的处理程序。处理程序通常是在设备驱动程序初始化时注册的设备驱动程序内的函数。IRQ 函数处理程序在中断模式下执行,并暂时禁用进一步的中断。

As shown in Figure 9-2 in Chapter 9, the CPU that receives an interrupt executes the do_IRQ function. The IRQ number causes the right handler to be invoked. The handler is typically a function within the device driver registered at device driver initialization time. IRQ function handlers are executed in interrupt mode, with further interrupts temporarily disabled.

正如第 9 章“中断处理程序”部分所讨论的,中断处理程序执行一些立即任务,并在下半部分安排其他任务稍后执行。具体来说,中断处理程序:

As discussed in the section "Interrupt Handlers" in Chapter 9, the interrupt handler performs a few immediate tasks and schedules others in a bottom half to be executed later. Specifically, the interrupt handler:

  1. 将帧复制到sk_buff数据结构中。[ * ]

  2. Copies the frame into an sk_buff data structure.[*]

  3. 初始化一些sk_buff参数供上层网络层稍后使用(值得注意的是skb->protocol,它标识高层协议处理程序并将在第 13 章中发挥主要作用)。

  4. Initializes some of the sk_buff parameters for use later by upper network layers (notably skb->protocol, which identifies the higher-layer protocol handler and will play a major role in Chapter 13).

  5. 更新设备私有的一些其他参数,我们在本章中不考虑这些参数,因为它们不会影响帧在网络堆栈内的路径。

  6. Updates some other parameters private to the device, which we do not consider in this chapter because they do not influence the frame's path inside the network stack.

  7. 通过调度软中断的执行来向内核发出有关新帧的信号NET_RX_SOFTIRQ

  8. Signals the kernel about the new frame by scheduling the NET_RX_SOFTIRQ softirq for execution.

由于设备可以出于不同的原因(收到新帧、帧传输成功完成等)发出中断,因此内核会随中断通知一起获得代码,以便设备驱动程序处理程序可以根据类型处理中断。

Since a device can issue an interrupt for different reasons (new frame received, frame transmission successfully completed, etc.), the kernel is given a code along with the interrupt notification so that the device driver handler can process the interrupt based on the type.

与其他功能的交互

Interactions with Other Features

在仔细阅读本章介绍的例程时,您经常会看到与可选内核功能交互的代码片段。对于本书中涵盖的功能,我将向您推荐有关该功能的章节;对于其他功能,我不会在代码上花太多时间。本章中的大多数流程图还显示了在例程中处理这些可选功能的位置。

While perusing the routines introduced in this chapter, you will often see pieces of code for interacting with optional kernel features. For features covered in this book, I will refer you to the chapter on that feature; for other features, I will not spend much time on the code. Most of the flowcharts in the chapter also show where those optional features are handled in the routines.

以下是我们将看到的可选功能以及相关的内核符号:

Here are the optional features we'll see, with the associated kernel symbols:

802.1d 以太网桥接(CONFIG_BRIDGE/CONFIG_BRIDGE_MODULE)
802.1d Ethernet Bridging (CONFIG_BRIDGE/CONFIG_BRIDGE_MODULE)

桥接在第四部分中描述。

Bridging is described in Part IV.

网络轮询 (CONFIG_NETPOLL)
Netpoll (CONFIG_NETPOLL)

Netpoll 是一个通用框架,用于通过轮询网络接口卡 (NIC) 来发送和接收帧,从而无需中断。Netpoll 可以被任何受益于其功能的内核功能使用;一个突出的例子是 Netconsole,它通过 UDP 将内核消息(即打印有 的字符串 printk)记录到远程主机。Netconsole 及其子选项可以通过make xconfig菜单中的“网络支持 → 网络控制台日志记录支持”选项打开。要使用 Netpoll,设备必须支持它(很多设备已经支持)。

Netpoll is a generic framework for sending and receiving frames by polling the network interface cards (NICs), eliminating the need for interrupts. Netpoll can be used by any kernel feature that benefits from its functionality; one prominent example is Netconsole, which logs kernel messages (i.e., strings printed with printk) to a remote host via UDP. Netconsole and its suboptions can be turned on from the make xconfig menu with the "Networking support → Network console logging support" option. To use Netpoll, devices must include support for it (which quite a few already do).

数据包操作 (CONFIG_NET_CLS_ACT)
Packet Action (CONFIG_NET_CLS_ACT)

通过此功能,流量控制可以对入口流量进行分类和应用操作。可能的操作包括丢弃数据包和消耗数据包。要从make xconfig菜单中查看此选项及其所有子选项,您需要首先选择“网络支持 → 网络选项 → QoS 和/或公平队列 → 数据包分类器 API”选项。

With this feature, Traffic Control can classify and apply actions to ingress traffic. Possible actions include dropping the packet and consuming the packet. To see this option and all its suboptions from the make xconfig menu, you need first to select the "Networking support → Networking options → QoS and/or fair queueing → Packet classifier API" option.

启用和禁用设备

Enabling and Disabling a Device

_ _LINK_STATE_START当在 中设置该标志时,可以认为设备已启用net_device->state第 8 章中的“启用和禁用设备”部分介绍了该标志的详细信息。该标志通常在设备打开 ( ) 时设置,并在设备关闭 ( ) 时清除。虽然有一个标志用于显式启用和禁用设备的传输 ( ),但没有用于启用和禁用接收的标志。该功能是通过其他方式实现的,即通过禁用dev_opendev_close_ _LINK_STATE_XOFF 该设备,如第 8 章所述。_ _LINK_STATE_START可以使用该函数检查标志的状态netif_running

A device can be considered enabled when the _ _LINK_STATE_START flag is set in net_device->state. The section "Enabling and Disabling a Device" in Chapter 8 covers the details of this flag. The flag is normally set when the device is open (dev_open) and cleared when the device is closed (dev_close). While there is a flag that is used to explicitly enable and disable transmission for a device (_ _LINK_STATE_XOFF), there is none to enable and disable reception. That capability is achieved by other means—i.e., by disabling the device, as described in Chapter 8. The status of the _ _LINK_STATE_START flag can be checked with the netif_running function.

本章后面显示的几个函数提供了简单的包装器,用于检查标志的正确状态,例如_ _LINK_STATE_START确保设备准备好执行将要执行的操作。

Several functions shown later in this chapter provide simple wrappers that check the correct status of flags such as _ _LINK_STATE_START to make sure the device is ready to do what is about to be asked of it.

队列

Queues

在讨论 L2 行为时,我经常谈论正在接收的帧的队列(入口队列)并传输(出口队列 )。每个队列都有一个指向与其关联的设备以及skb_buff存储入口/出口缓冲区的数据结构的指针。只有少数专用设备无需队列即可工作;一个例子是环回设备。环回设备可以省去队列,因为当您将数据包传输出环回设备时,数据包会立即传递(到本地系统),无需中间排队。此外,由于环回设备上的传输不会失败,因此无需将数据包重新排队以进行另一次传输尝试。

When discussing L2 behavior, I often talk about queues for frames being received (ingress queues ) and transmitted (egress queues ). Each queue has a pointer to the devices associated with it, and to the skb_buff data structures that store the ingress/egress buffers. Only a few specialized devices work without queues; an example is the loopback device. The loopback device can dispense with queues because when you transmit a packet out of the loopback device, the packet is immediately delivered (to the local system) with no need for intermediate queuing. Moreover, since transmissions on the loopback device cannot fail, there is no need to requeue the packet for another transmission attempt.

出口队列直接与设备关联;流量控制(服务质量或 QoS 层)为每个设备定义一个队列。正如我们将在第 11 章中看到的,内核跟踪等待传输帧的设备,而不是帧本身。我们还将看到并非所有设备实际上都使用流量控制。入口队列的情况有点复杂,我们稍后会看到。

Egress queues are associated directly to devices; Traffic Control (the Quality of Service, or QoS, layer) defines one queue for each device. As we will see in Chapter 11, the kernel keeps track of devices waiting to transmit frames, not the frames themselves. We will also see that not all devices actually use Traffic Control. The situation with ingress queues is a bit more complicated, as we'll see later.

通知内核帧接收:NAPI 和 netif_rx

Notifying the Kernel of Frame Reception: NAPI and netif_rx

在版本 2.5 中(然后也向后移植到最新版本的 2.4),Linux 内核中引入了一个用于处理入口帧的新 API,称为 NAPI(由于缺乏更好的名称)。由于很少有设备升级到 NAPI,Linux 驱动程序可以通过两种方式向内核通知新帧:

In version 2.5 (then backported to a late revision of 2.4 as well), a new API for handling ingress frames was introduced into the Linux kernel, known (for lack of a better name) as NAPI. Since few devices have been upgraded to NAPI, there are two ways a Linux driver can notify the kernel about a new frame:

通过旧函数 netif_rx
By means of the old function netif_rx

这是遵循第 9 章“中断期间处理多个帧”部分中描述的技术的设备所使用的方法。大多数 Linux 设备驱动程序仍然使用这种方法。

This is the approach used by those devices that follow the technique described in the section "Processing Multiple Frames During an Interrupt" in Chapter 9. Most Linux device drivers still use this approach.

通过NAPI机制
By means of the NAPI mechanism

这是那些遵循第 9 章“在中断期间处理多个帧”部分末尾介绍的变体中描述的技术的设备所使用的方法。这是 Linux 内核中的新功能,只有少数驱动程序使用它。drivers/net/tg3.c是第一个转换为 NAPI 的文件。

This is the approach used by those devices that follow the technique described in the variation introduced at the end of the section "Processing Multiple Frames During an Interrupt" in Chapter 9. This is new in the Linux kernel, and only a few drivers use it. drivers/net/tg3.c was the first one to be converted to NAPI.

当您使用make xconfig等工具配置内核选项时,一些设备驱动程序允许您在两种类型的接口之间进行选择。

A few device drivers allow you to choose between the two types of interfaces when you configure the kernel options with tools such as make xconfig.

下面的代码来自vortex_rx,它仍然使用旧函数netif_rx,并且您可以期望大多数尚未使用 NAPI 的网络设备驱动程序执行类似的操作:

The following piece of code comes from vortex_rx, which still uses the old function netif_rx, and you can expect most of the network device drivers not yet using NAPI to do something similar:

    skb = dev_alloc_skb(pkt_len + 5);
        …………
    如果(skb!= NULL){
        skb->dev = dev;
        skb_reserve(skb, 2); /* 在 16 字节边界上对齐 IP */
                …………
                /* 将DATA复制到sk_buff结构中 */
                …………
        skb->协议 = eth_type_trans(skb, dev);
        netif_rx(skb);
        dev->last_rx = jiffies;
            …………
    }
    skb = dev_alloc_skb(pkt_len + 5);
        ... ... ...
    if (skb != NULL) {
        skb->dev = dev;
        skb_reserve(skb, 2);    /* Align IP on 16 byte boundaries */
                ... ... ...
                /* copy the DATA into the sk_buff structure */
                ... ... ...
        skb->protocol = eth_type_trans(skb, dev);
        netif_rx(skb);
        dev->last_rx = jiffies;
            ... ... ...
    }

首先,sk_buff分配数据结构 dev_alloc_skb(参见第2章),并将帧复制到其中。请注意,在复制之前,代码会保留两个字节,以便将 IP 标头与 16 字节边界对齐。每个网络设备驱动程序都与给定的接口类型相关联;例如,Vortex 设备驱动程序driver/net/3c59x.c 与特定系列的以太网卡相关联。因此,驱动程序知道链路层报头的长度以及如何解释它。给定标头长度 16 *k+n,驱动程序可以通过简单地使用偏移skb_reserve量 16− n进行调用来强制对齐到 16 字节边界。以太网标头为 14 个字节,因此k =0, n =14,代码请求的偏移量为 2(请参阅include/linux/sk_buff.hNET_IP_ALIGN中的定义和相关注释)。

First, the sk_buff data structure is allocated with dev_alloc_skb (see Chapter 2), and the frame is copied into it. Note that before copying, the code reserves two bytes to align the IP header to a 16-byte boundary. Each network device driver is associated with a given interface type; for instance, the Vortex device driver driver/net/3c59x.c is associated with a specific family of Ethernet cards. Therefore, the driver knows the length of the link layer's header and how to interpret it. Given a header length of 16*k+n, the driver can force an alignment to a 16-byte boundary by simply calling skb_reserve with an offset of 16−n. An Ethernet header is 14 bytes, so k=0, n=14, and the offset requested by the code is 2 (see the definition of NET_IP_ALIGN and the associated comment in include/linux/sk_buff.h).

另请注意,在此阶段,驱动程序不会对不同的 L3 协议进行任何区分。无论类型如何,它将 L3 标头与 16 字节边界对齐。由于 IP 的广泛使用,L3 协议可能是 IP,但目前还不能保证这一点;它可能是 Netware 的 IPX 或其他东西。无论使用哪种 L3 协议,对齐都是有用的。

Note also that at this stage, the driver does not make any distinction between different L3 protocols. It aligns the L3 header to a 16-byte boundary regardless of the type. The L3 protocol is probably IP because of IP's widespread usage, but that is not guaranteed at this point; it could be Netware's IPX or something else. The alignment is useful regardless of the L3 protocol to be used.

eth_type_trans,用于提取协议标识符,在第13章skb->protocol中描述 。[ * ]

eth_type_trans, which is used to extract the protocol identifier skb->protocol, is described in Chapter 13.[*]

根据驱动程序设计的复杂性,所示的块后面可能会执行其他内务任务,但我们对本书中的这些细节不感兴趣。该函数最重要的部分是向内核通知帧的接收情况。

Depending on the complexity of the driver's design, the block shown may be followed by other housekeeping tasks, but we are not interested in those details in this book. The most important part of the function is the notification to the kernel about the frame's reception.

新API(NAPI)简介

Introduction to the New API (NAPI)

尽管某些NIC设备驱动程序尚未转换为NAPI,但新的基础架构已集成到内核中,甚至内核与netif_rx其余部分之间的接口也必须考虑NAPI。我们不会先介绍旧方法(纯netif_rx),然后再讨论 NAPI,而是先看到 NAPI,然后展示旧驱动程序如何保留其旧接口(netif_rx),同时共享一些新的基础设施机制。

Even though some of the NIC device drivers have not been converted to NAPI yet, the new infrastructure has been integrated into the kernel, and even the interface between netif_rx and the rest of the kernel has to take NAPI into account. Instead of introducing the old approach (pure netif_rx) first and then talking about NAPI, we will first see NAPI and then show how the old drivers keep their old interface (netif_rx) while sharing some of the new infrastructure mechanisms.

NAPI 将中断与轮询混合在一起,通过显着减少 CPU 负载,在高流量负载下提供比旧方法更高的性能。内核开发人员将该基础设施向后移植到 2.4 内核。

NAPI mixes interrupts with polling and gives higher performance under high traffic load than the old approach, by reducing significantly the load on the CPU. The kernel developers backported that infrastructure to the 2.4 kernels.

在旧模型中,设备驱动程序为其接收的每个帧生成一个中断。在高流量负载下,处理中断所花费的时间可能会导致大量的资源浪费。

In the old model, a device driver generates an interrupt for each frame it receives. Under a high traffic load, the time spent handling interrupts can lead to a considerable waste of resources.

NAPI 背后的主要思想很简单:它不使用纯粹的中断驱动模型,而是混合使用中断和轮询。如果在内核尚未处理完之前的帧时接收到新帧,则驱动程序无需生成其他中断:让内核继续处理设备输入队列中的任何内容(禁用中断)会更容易对于设备),并在队列为空时重新启用中断。通过这种方式,驾驶员可以获得以下优势:中断和轮询:

The main idea behind NAPI is simple: instead of using a pure interrupt-driven model, it uses a mix of interrupts and polling. If new frames are received when the kernel has not finished handling the previous ones yet, there is no need for the driver to generate other interrupts: it is just easier to have the kernel keep processing whatever is in the device input queue (with interrupts disabled for the device), and re-enable interrupts once the queue is empty. This way, the driver reaps the advantages of both interrupts and polling:

  • 异步事件(例如接收一个或多个帧)由中断指示,因此内核不必连续检查设备的入口队列是否为空。

  • Asynchronous events, such as the reception of one or more frames, are indicated by interrupts so that the kernel does not have to check continuously if the device's ingress queue is empty.

  • 如果内核知道设备的入口队列中还有剩余内容,则无需浪费时间处理中断通知。一个简单的投票就足够了。

  • If the kernel knows there is something left in the device's ingress queue, there is no need to waste time handling interrupt notifications. A simple polling is enough.

从内核处理的角度来看,NAPI 方法的一些优点如下:

From the kernel processing point of view, here are some of the advantages of the NAPI approach:

减少 CPU 负载(因为中断更少)
Reduced load on the CPU (because there are fewer interrupts)

给定相同的工作负载(即每秒帧数),NAPI 的 CPU 负载较低。在高工作负载时尤其如此。根据内核开发人员在内核邮件列表上发布的测试,在低工作负载时,NAPI 的 CPU 使用率实际上可能会稍高一些。

Given the same workload (i.e., number of frames per second), the load on the CPU is lower with NAPI. This is especially true at high workloads. At low workloads, you may actually have slightly higher CPU usage with NAPI, according to tests posted by the kernel developers on the kernel mailing list.

设备处理更加公平
More fairness in the handling of devices

稍后我们将看到如何以循环方式公平地访问入口队列中包含某些内容的设备。这可以确保即使其他设备负载更高,流量较低的设备也可以经历可接受的延迟。

We will see later how devices that have something in their ingress queues are accessed fairly in a round-robin fashion. This ensures that devices with low traffic can experience acceptable latencies even when other devices are much more loaded.

NAPI 使用的 net_device 字段

net_device Fields Used by NAPI

在研究 NAPI 的实现和使用之前,我需要描述一下第 9 章“ softnet_data 结构net_device一节中提到的数据结构的几个字段。

Before looking at NAPI's implementation and use, I need to describe a few fields of the net_device data structure, mentioned in the section "softnet_data Structure" in Chapter 9.

此结构中添加了四个新字段,供NET_RX_SOFTIRQ软中断在处理驱动程序使用 NAPI 接口的设备时使用。net_device其他设备不会使用它们,但它们会共享该结构中嵌入的结构体字段softnet_data作为其backlog_dev字段。

Four new fields have been added to this structure for use by the NET_RX_SOFTIRQ softirq when dealing with devices whose drivers use the NAPI interface. The other devices will not use them, but they will share the fields of the net_device structure embedded in the softnet_data structure as its backlog_dev field.

poll
poll

用于将缓冲区从设备的入口队列中出列的虚拟函数。该队列对于使用 NAPI 的设备和其他设备来说是私有的softnet_data->input_pkt_queue请参阅“积压处理:process_backlog Poll 虚拟函数”部分。

A virtual function used to dequeue buffers from the device's ingress queue. The queue is a private one for devices using NAPI, and softnet_data->input_pkt_queue for others. See the section "Backlog Processing: The process_backlog Poll Virtual Function."

poll_list
poll_list

入口队列中有新帧等待处理的设备列表。这些设备被称为处于轮询状态。名单的首位是softnet_data->poll_list。此列表中的设备已禁用中断,内核当前正在轮询它们。

List of devices that have new frames in the ingress queue waiting to be processed. These devices are known as being in polling state. The head of the list is softnet_data->poll_list. Devices in this list have interrupts disabled and the kernel is currently polling them.

quota
quota

weight
weight

quotapoll是一个整数,表示虚拟函数一次可以出队的最大缓冲区数。它的值以 为单位递增weight,用于在不同设备之间强制执行某种公平性。较低的配额意味着较低的潜在延迟,因此其他设备资源匮乏的风险也较低。另一方面,低配额会增加设备之间的切换量,从而增加总体开销。

对于与非 NAPI 驱动程序关联的设备,默认值为weight64,存储在net/core/dev.cweight_p的顶部。的值可以通过/proc更改。weight_p

对于与 NAPI 驱动程序关联的设备,默认值由驱动程序选择。最常见的值是 64,但也使用 16 和 32。它的值可以通过sysfs进行调整。

对于/procsysfs接口,请参阅第 12 章中的“通过 /proc 和 sysfs 文件系统进行调整”部分。

quota is an integer that represents the maximum number of buffers that can be dequeued by the poll virtual function in one shot. Its value is incremented in units of weight and it is used to enforce some sort of fairness among different devices. Lower quotas mean lower potential latencies and therefore a lower risk of starving other devices. On the other hand, a low quota increases the amount of switching among devices, and therefore overall overhead.

For devices associated with non-NAPI drivers, the default value of weight is 64, stored in weight_p at the top of net/core/dev.c. The value of weight_p can be changed via /proc.

For devices associated with NAPI drivers, the default value is chosen by the drivers. The most common value is 64, but 16 and 32 are used, too. Its value can be tuned via sysfs.

For both the /proc and sysfs interfaces, see the section "Tuning via /proc and sysfs Filesystems" in Chapter 12.

“旧驱动程序接口与新驱动程序接口”部分描述了如何以及何时将元素添加到,“积压处理:process_backlog Poll 虚拟函数poll_list部分描述了该 方法何时从列表中提取元素以及如何 根据 的值进行更新。pollquotaweight

The section "Old Versus New Driver Interfaces" describes how and when elements are added to poll_list, and the section "Backlog Processing: The process_backlog Poll Virtual Function" describes when the poll method extracts elements from the list and how quota is updated based on the value of weight.

使用NAPI的设备net_device根据第8章中描述的初始化模型初始化这四个字段和其他字段。对于假设备,在第 9 章“ softnet_data 初始化backlog_dev一节中介绍并在本章后面描述,初始化由 (第 5 章中描述)负责。net_dev_init

Devices using NAPI initialize these four fields and other net_device fields according to the initialization model described in Chapter 8. For the fake backlog_dev devices, introduced in the section "Initialization of softnet_data" in Chapter 9 and described later in this chapter, the initialization is taken care of by net_dev_init (described in Chapter 5).

net_rx_action 和 NAPI

net_rx_action and NAPI

图 10-1显示了每次内核轮询传入网络流量时发生的情况。poll_list在图中,您可以看到轮询状态的设备列表、poll虚拟函数和软件中断处理程序之间的关系net_rx_action。以下部分将详细介绍该图的各个方面,但在转向源代码之前了解各个部分如何交互非常重要。

Figure 10-1 shows what happens each time the kernel polls for incoming network traffic. In the figure, you can see the relationships among the poll_list list of devices in polling state, the poll virtual function, and the software interrupt handler net_rx_action. The following sections will go into detail on each aspect of that diagram, but it is important to understand how the parts interact before moving to the source code.

net_rx_action 函数和 NAPI 概述

图 10-1。net_rx_action 函数和 NAPI 概述

Figure 10-1. net_rx_action function and NAPI overview

我们已经知道这net_rx_action是与标志相关的函数NET_RX_SOFTIRQ。为了简单起见,我们假设在一段非常低的活动之后,一些设备开始接收帧,并且这些设备以某种方式触发执行net_rx_action——它们如何执行目前并不重要。

We already know that net_rx_action is the function associated with the NET_RX_SOFTIRQ flag. For the sake of simplicity, let's suppose that after a period of very low activity, a few devices start receiving frames and that these somehow trigger the execution of net_rx_action—how they do so is not important for now.

net_rx_action浏览处于轮询状态的设备列表,并poll为每个设备调用关联的虚拟功能来处理入口队列中的帧。我之前解释过,该列表中的设备以循环方式进行查询,并且每次poll调用其方法时,它们可以处理的帧数有最大数量。如果他们无法在其时段内清除队列,则必须等待下一个时段继续。这意味着net_rx_action不断调用poll设备驱动程序为入口队列中有某些内容的设备提供的方法,直到后者清空。此时,不再需要轮询,设备驱动程序可以重新启用设备的中断通知。需要强调的是,仅对 中的设备禁用中断 poll_list,这仅适用于使用 NAPI 且不共享的设备backlog_dev

net_rx_action browses the list of devices in polling state and calls the associated poll virtual function for each device to process the frames in the ingress queue. I explained earlier that devices in that list are consulted in a round-robin fashion, and that there is a maximum number of frames they can process each time their poll method is invoked. If they cannot clear the queue during their slot, they have to wait for their next slot to continue. This means that net_rx_action keeps calling the poll method provided by the device driver for a device with something in its ingress queue until the latter empties out. At that point, there is no need anymore for polling, and the device driver can re-enable interrupt notifications for the device. It is important to underline that interrupts are disabled only for those devices in poll_list, which applies only to devices that use NAPI and do not share backlog_dev.

net_rx_action限制其执行时间,并在超过给定的执行时间或已处理帧限制时重新安排自身执行;强制执行此操作是为了使net_rx_action行为与其他内核任务公平相关。同时,每个设备都会限制每次调用其 poll方法所处理的帧数,以便相对于其他设备公平。当设备无法清除其入口队列时,它必须等到下一次调用其poll方法。

net_rx_action limits its execution time and reschedules itself for execution when it passes a given limit of execution time or processed frames; this is enforced to make net_rx_action behave fairly in relation to other kernel tasks. At the same time, each device limits the number of frames processed by each invocation of its poll method to be fair in relation to other devices. When a device cannot clear out its ingress queue, it has to wait until the next call of its poll method.

旧驱动程序接口与新驱动程序接口

Old Versus New Driver Interfaces

现在结构中与 NAPI 相关的字段的含义net_device以及 NAPI 背后的高级思想应该很清楚了,我们可以更接近源代码了。

Now that the meaning of the NAPI-related fields of the net_device structure, and the high-level idea behind NAPI, should be clear, we can get closer to the source code.

图 10-2显示了 NAPI 感知驱动程序与其他驱动程序之间的差异,即驱动程序如何告知内核新帧的接收。

Figure 10-2 shows the difference between a NAPI-aware driver and the others with regard to how the driver tells the kernel about the reception of new frames.

从设备驱动程序的角度来看,NAPI 和非 NAPI 之间只有两个区别。首先,NAPI 驱动程序必须提供一种方法,如“ NAPI 使用的 net_device 字段poll”部分中所述。第二个区别是调度帧所调用的函数:非 NAPI 驱动程序 call ,而 NAPI 驱动程序 call ,在include/linux/netdevice.h中定义。(内核提供了一个名为 的包装函数 ,它检查以确保设备正在运行并且软中断尚未调度,然后调用 。这些检查是通过 完成的 。一些驱动程序调用,而其他驱动程序调用netif_rx_ _netif_rx_schedulenetif_rx_schedule_ _netif_rx_schedulenetif_rx_schedule_prepnetif_rx_schedulenetif_rx_schedule_prep明确地然后_ _netif_rx_schedule如果需要的话)。

From the device driver perspective, there are only two differences between NAPI and non-NAPI. The first is that NAPI drivers must provide a poll method, described in the section "net_device fields used by NAPI." The second difference is the function called to schedule a frame: non-NAPI drivers call netif_rx, whereas NAPI drivers call _ _netif_rx_schedule, defined in include/linux/netdevice.h. (The kernel provides a wrapper function named netif_rx_schedule, which checks to make sure that the device is running and that the softirq is not already scheduled, and then it calls _ _netif_rx_schedule. These checks are done with netif_rx_schedule_prep. Some drivers call netif_rx_schedule, and others call netif_rx_schedule_prep explicitly and then _ _netif_rx_schedule if needed).

如图 10-2所示,两种类型的驱动程序都将输入设备排队到轮询列表 ( poll_list) 中,安排NET_RX_SOFTIRQ 软件中断执行,因此最终都由 处理net_rx_action。尽管两种类型的驱动程序最终都会调用_ _netif_rx_schedule(非 NAPI 驱动程序在 中执行此操作netif_rx),但 NAPI 设备可能会提供更好的性能,原因我们在第 9 章的“收到帧时通知驱动程序”一节中看到。

As shown in Figure 10-2, both types of drivers queue the input device to a polling list (poll_list), schedule the NET_RX_SOFTIRQ software interrupt for execution, and therefore end up being handled by net_rx_action. Even though both types of drivers ultimately call _ _netif_rx_schedule (non-NAPI drivers do so within netif_rx), the NAPI devices offer potentially much better performance for the reasons we saw in the section "Notifying Drivers When Frames Are Received" in Chapter 9.

NAPI 感知的驱动程序与非 NAPI 感知的设备

图 10-2。NAPI 感知的驱动程序与非 NAPI 感知的设备

Figure 10-2. NAPI-aware drivers versus non-NAPI-aware devices

图 10-2中的一个重要细节 是在两种情况下net_device传递到的结构 。_ _netif_rx_schedule非 NAPI 设备使用内置于 CPUsoftnet_data结构中的结构,而 NAPI 设备则使用net_device引用自身的结构。

An important detail in Figure 10-2 is the net_device structure that is passed to _ _netif_rx_schedule in the two cases. Non-NAPI devices use the one that is built into the CPU's softnet_data structure, and NAPI devices use net_device structures that refer to themselves.

操作 poll_list

Manipulating poll_list

我们在上一节中看到,任何设备(包括假设备, )都可以 通过调用或 来backlog_dev添加到列表中。poll_listnetif_rx_schedule_ _netif_rx_schedule

We saw in the previous section that any device (including the fake one, backlog_dev) is added to the poll_list list with a call to netif_rx_schedule or _ _netif_rx_schedule.

相反的操作,从列表中删除一个设备,是用netif_rx_complete或完成的_ _netif_rx_complete(第二个操作假设本地 CPU 上已经禁用了中断)。我们将在“处理 NET_RX_SOFTIRQ:net_rx_action ”部分中看到何时调用这两个例程。

The reverse operation, removing a device from the list, is done with netif_rx_complete or _ _netif_rx_complete (the second one assumes interrupts are already disabled on the local CPU). We will see when these two routines are called in the section "Processing the NET_RX_SOFTIRQ: net_rx_action."

设备还可以分别使用 和 暂时禁用和重新启用netif_poll_disable轮询netif_poll_enable。这并不意味着设备驱动程序已决定恢复到基于中断的模型。例如,当设备驱动程序需要重置设备以应用某种硬件配置更改时,可能会在设备上禁用轮询。

A device can also temporarily disable and re-enable polling with netif_poll_disable and netif_poll_enable, respectively. This does not mean that the device driver has decided to revert to an interrupt-based model. Polling might be disabled on a device, for instance, when the device needs to be reset by the device driver to apply some kind of hardware configuration changes.

我已经说过netif_rx_schedule过滤对已经存在的设备的请求poll_list(即设置了_ _LINK_STATE_RX_SCHED标志的设备)。因此,如果驱动程序设置该标志但不将设备添加到poll_list,它基本上会禁用对设备的轮询:该设备永远不会被添加到poll_list。这就是netif_poll_disable工作原理:如果_ _LINK_STATE_RX_SCHED没有设置,它只是设置它并返回。否则,它会等待它被清除然后设置它。

I already said that netif_rx_schedule filters requests for devices that are already in the poll_list (i.e., that have the _ _LINK_STATE_RX_SCHED flag set). For this reason, if a driver sets that flag but does not add the device to poll_list, it basically disables polling for the device: the device will never be added to poll_list. This is how netif_poll_disable works: if _ _LINK_STATE_RX_SCHED was not set, it simply sets it and returns. Otherwise, it waits for it to be cleared and then sets it.

静态内联无效 netif_poll_disable(struct net_device *dev)
{
    while (test_and_set_bit(_ _LINK_STATE_RX_SCHED, &dev->state)) {
        /* 不着急。*/
        当前->状态 = TASK_INTERRUPTIBLE:
        调度超时(1);
    }
}
static inline void netif_poll_disable(struct net_device *dev)
{
    while (test_and_set_bit(_ _LINK_STATE_RX_SCHED, &dev->state)) {
        /* No hurry. */
        current->state = TASK_INTERRUPTIBLE:
        schedule_timeout(1);
    }
}

设备驱动程序和内核之间的旧接口:netif_rx 的第一部分

Old Interface Between Device Drivers and Kernel: First Part of netif_rx

netif_rx函数定义在 net/core/dev.c中,通常在新输入帧等待处理时由设备驱动程序调用;[ * ]它的工作是调度运行不久的软中断以出队并处理帧。图 10-3显示了它检查的内容及其事件流程。该图实际上比代码长,但它有助于理解如何netif_rx对其上下文做出反应。

The netif_rx function, defined in net/core/dev.c, is normally called by device drivers when new input frames are waiting to be processed;[*] its job is to schedule the softirq that runs shortly to dequeue and handle the frames. Figure 10-3 shows what it checks for and the flow of its events. The figure is practically longer than the code, but it is useful to help understand how netif_rx reacts to its context.

netif_rx通常由驱动程序在中断上下文中调用,但也有例外,特别是当该函数由环回设备调用时。因此,netif_rx在启动时禁用本地 CPU 上的中断,并在完成时重新启用它们。[ ]

netif_rx is usually called by a driver while in interrupt context, but there are exceptions, notably when the function is called by the loopback device. For this reason, netif_rx disables interrupts on the local CPU when it starts, and re-enables them when it finishes.[]

在查看代码时,应该记住不同的 CPU 可以netif_rx同时运行。这不是问题,因为每个 CPU 都与softnet_data维护状态信息的私有结构相关联。除此之外,CPU 的结构包括一个私有输入队列(参见第 9 章中的“ softnet_data 结构softnet_data部分)。

When looking at the code, one should keep in mind that different CPUs can run netif_rx concurrently. This is not a problem, since each CPU is associated with a private softnet_data structure that maintains state information. Among other things, the CPU's softnet_data structure includes a private input queue (see the section "softnet_data Structure" in Chapter 9).

netif_rx 函数

图 10-3。netif_rx 函数

Figure 10-3. netif_rx function

这是该函数的原型:

This is the function's prototype:

int netif_rx(结构 sk_buff *skb)
int netif_rx(struct sk_buff *skb)

它唯一的输入参数是设备接收到的缓冲区,输出值是拥塞级别的指示(您可以在“拥塞管理”部分找到详细信息)。

Its only input parameter is the buffer received by the device, and the output value is an indication of the congestion level (you can find details in the section "Congestion Management").

的主要任务,其详细流程图如图10-3netif_rx所示,包括:

The main tasks of netif_rx, whose detailed flowchart is depicted in Figure 10-3, include:

  • 初始化一些sk_buff数据结构字段(例如接收帧的时间)。

  • Initializing some of the sk_buff data structure fields (such as the time the frame was received).

  • 将接收到的帧存储到 CPU 的专用输入队列中,并通过触发关联的 softirq 向内核通知该帧NET_RX_SOFTIRQ。仅当满足某些条件时才会执行此步骤,其中最重要的是队列中是否有空间。

  • Storing the received frame onto the CPU's private input queue and notifying the kernel about the frame by triggering the associated softirq NET_RX_SOFTIRQ. This step takes place only if certain conditions are met, the most important of which is whether there is space in the queue.

  • 更新有关拥塞级别的统计数据。

  • Updating the statistics about the congestion level.

图 10-4显示了具有大量 CPU 和设备的系统示例。每个CPU都有自己的实例softnet_data,其中包括 netif_rx存储入口帧的专用输入队列,以及completion_queue不再需要缓冲区时发送缓冲区的位置(请参阅第11章中的“处理NET_TX_SOFTIRQ:net_tx_action ”部分)。该图显示了 CPU 1从eth0接收中断的示例。关联的驱动程序将入口帧存储到 CPU 1 的队列中。CPU m 收到来自ethn的中断,表示不再需要传输的缓冲区,因此可以将其移至RxCompleteDMADonecompletion_queue 队列。[ * ]

Figure 10-4 shows an example of a system with a bunch of CPUs and devices. Each CPU has its own instance of softnet_data, which includes the private input queue where netif_rx will store ingress frames, and the completion_queue where buffers are sent when they are not needed anymore (see the section "Processing the NET_TX_SOFTIRQ: net_tx_action" in Chapter 11). The figure shows an example where CPU 1 receives an RxComplete interrupt from eth0. The associated driver stores the ingress frame into CPU 1's queue. CPU m receives a DMADone interrupt from ethn saying that the transmitted buffer is not needed anymore and can therefore be moved to the completion_queue queue.[*]

netif_rx 的初始任务

Initial Tasks of netif_rx

netif_rx首先将调用函数的时间(也表示接收帧的时间)保存到stamp缓冲区结构的字段中:

netif_rx starts by saving the time the function was invoked (which also represents the time the frame was received) into the stamp field of the buffer structure:

   if (skb->stamp.tv_sec == 0)
           net_timestamp(&skb->stamp);
   if (skb->stamp.tv_sec == 0)
           net_timestamp(&skb->stamp);

保存时间戳会消耗 CPU 成本,因此 仅当至少有一个对该字段感兴趣的用户时才net_timestamp进行初始化。skb->stamp对该领域的兴趣可以通过致电来发布net_enable_timestamp

Saving the timestamp has a CPU cost—therefore, net_timestamp initializes skb->stamp only if there is at least one interested user for that field. Interest in the field can be advertised by calling net_enable_timestamp.

不要将此分配与设备驱动程序在调用之前或之后完成的分配混淆netif_rx

Do not confuse this assignment with the one done by the device driver right before or after it calls netif_rx:

  netif_rx(skb);
  dev->last_rx = jiffies;
  netif_rx(skb);
  dev->last_rx = jiffies;
CPU 的入口队列

图 10-4。CPU 的入口队列

Figure 10-4. CPU's ingress queues

设备驱动程序将接收到的最新net_device帧的 时间存储在结构中,并将接收到帧的时间存储在缓冲区本身中。因此,一个时间戳与设备相关联,另一个时间戳与帧相关联。此外请注意,两个时间戳使用两种不同的精度。设备驱动程序将最近帧的时间戳存储在 中,在内核 2.6 中,其精度为 10 或 1 毫秒,具体取决于架构(例如,在 2.6 之前,i386 使用值 10,但从 2.6 开始值为 1)。但是,通过调用 来获取其时间戳,这会返回更精确的值。netif_rxjiffiesnetif_rxget_fast_time

The device driver stores in the net_device structure the time its most recent frame was received, and netif_rx stores the time the frame was received in the buffer itself. Thus, one timestamp is associated with a device and the other one is associated with a frame. Note, moreover, that the two timestamps use two different precisions. The device driver stores the timestamp of the most recent frame in jiffies, which in kernel 2.6 comes with a precision of 10 or 1 ms, depending on the architecture (for instance, before 2.6, the i386 used the value 10, but starting with 2.6 the value is 1). netif_rx, however, gets its timestamp by calling get_fast_time, which returns a far more precise value.

本地 CPU 的 ID 通过本地变量检索smp_processor_id( )并存储在本地变量中this_cpu

The ID of the local CPU is retrieved with smp_processor_id( ) and is stored in the local variable this_cpu:

  this_cpu = smp_processor_id( );
  this_cpu = smp_processor_id( );

需要本地 CPU ID 来检索每 CPU 向量中与该 CPU 关联的数据结构,例如以下代码netif_rx

The local CPU ID is needed to retrieve the data structure associated with that CPU in a per-CPU vector, such as the following code in netif_rx:

  队列 = &_ _get_cpu_var(softnet_data);
  queue = &_ _get_cpu_var(softnet_data);

前面的行存储了一个指向与本地 CPU 关联的结构的queue指针,该结构正在为调用的设备驱动程序触发的中断提供服务。softnet_datanetif_rx

The preceding line stores in queue a pointer to the softnet_data structure associated with the local CPU that is serving the interrupt triggered by the device driver that called netif_rx.

现在netif_rx更新 CPU 接收到的帧总数,包括接受的帧和丢弃的帧(例如,因为队列中没有空间):

Now netif_rx updates the total number of frames received by the CPU, including both the ones accepted and the ones discarded (because there was no space in the queue, for instance):

    netdev_rx_stat[this_cpu].total++
    netdev_rx_stat[this_cpu].total++

dev->priv每个设备驱动程序还保留统计信息,将它们存储在指向的私有数据结构中。这些统计数据(包括接收帧数、丢弃帧数等)是按设备保存的(请参阅第 2 章)更新的统计数据netif_rx是按 CPU 更新的。

Each device driver also keeps statistics, storing them in the private data structure that dev->priv points to. These statistics, which include the number of received frames, the number of dropped frames, etc., are kept on a per-device basis (see Chapter 2), and the ones updated by netif_rx are on a per-CPU basis.

管理队列和调度下半部分

Managing Queues and Scheduling the Bottom Half

输入队列由 管理softnet_data->input_pkt_queue。每个输入队列都有一个由全局变量 给定的最大长度netdev_max_backlog,其值为 300。这意味着每个 CPU 的输入队列中最多可以有 300 个帧等待处理,无论系统中的设备数量如何。[ * ]

The input queue is managed by softnet_data->input_pkt_queue. Each input queue has a maximum length given by the global variable netdev_max_backlog, whose value is 300. This means that each CPU can have up to 300 frames in its input queue waiting to be processed, regardless of the number of devices in the system.[*]

常识告诉我们, 的值netdev_max_backlog应该取决于设备的数量及其速度。然而,在中断在 CPU 之间动态分配的 SMP 系统中很难跟踪这一点。哪个设备将与哪个 CPU 通信并不明显。因此, 的值netdev_max_backlog是通过反复试验来选择的。将来,我们可以想象它以反映接口类型和数量的方式动态设置。它的值已经可以由系统管理员配置,如第 12 章中的“通过 /proc 和 sysfs 文件系统进行调整”部分所述。性能问题如下:不必要的大值会浪费内存,而缓慢的系统可能永远无法跟上。另一方面,太小的值可能会降低设备的性能,因为流量突发可能会导致许多帧丢失。最佳值在很大程度上取决于系统的角色(主机、服务器、路由器等)。

Common sense would say that the value of netdev_max_backlog should depend on the number of devices and their speeds. However, this is hard to keep track of in an SMP system where the interrupts are distributed dynamically among the CPUs. It is not obvious which device will talk to which CPU. Thus, the value of netdev_max_backlog is chosen through trial and error. In the future, we could imagine it being set dynamically in a manner reflecting the types and number of interfaces. Its value is already configurable by the system administrator, as described in the section "Tuning via /proc and sysfs Filesystems" in Chapter 12. The performance issues are as follows: an unnecessarily large value is a waste of memory, and a slow system may simply never be able to catch up. A value that is too small, on the other hand, could reduce the performance of the device because a burst of traffic could lead to many dropped frames. The optimal value depends a lot on the system's role (host, server, router, etc.).

在以前的内核中,当softnet_data 每 CPU 数据结构不存在时,称为 的单个输入队列backlog由具有相同大小 300 帧的所有设备共享。主要的好处softnet_data不是 n 个CPU 在队列上为n *300 帧留出空间,而是不需要在 CPU 之间进行锁定,因为每个 CPU 都有自己的队列。

In the previous kernels, when the softnet_data per-CPU data structure was not present, a single input queue, called backlog, was shared by all devices with the same size of 300 frames. The main gain with softnet_data is not that n CPUs leave room on the queues for n*300 frames, but rather, that there is no need for locking among CPUs because each has its own queue.

netif_rx以下代码控制将新帧插入队列的条件,以及调度队列运行的条件:

The following code controls the conditions under which netif_rx inserts its new frame on a queue, and the conditions under which it schedules the queue to be run:

    if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
        if (queue->input_pkt_queue.qlen) {
            if (队列->节流)
                转到下降;

入队:
            dev_hold(skb->dev);
            _ _skb_queue_tail(&queue->input_pkt_queue,skb);
#ifndef OFFLINE_SAMPLE
            get_sample_stats(this_cpu);
#万一
                   local_irq_restore(标志);
            返回队列->cng_level;
        }

        if (队列->节流)
            队列->节流= 0;

        netif_rx_schedule(&queue->backlog_dev);
        转到入队;
    }

    …………

降低:
    _ _get_cpu_var(netdev_rx_stat).dropped++;
    local_irq_restore(标志);

    kfree_skb(skb);
    返回NET_RX_DROP;
}
    if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
        if (queue->input_pkt_queue.qlen) {
            if (queue->throttle)
                goto drop;

enqueue:
            dev_hold(skb->dev);
            _ _skb_queue_tail(&queue->input_pkt_queue,skb);
#ifndef OFFLINE_SAMPLE
            get_sample_stats(this_cpu);
#endif
                   local_irq_restore(flags);
            return queue->cng_level;
        }

        if (queue->throttle)
            queue->throttle = 0;

        netif_rx_schedule(&queue->backlog_dev);
        goto enqueue;
    }

    ... ... ...

drop:
    _ _get_cpu_var(netdev_rx_stat).dropped++;
    local_irq_restore(flags);

    kfree_skb(skb);
    return NET_RX_DROP;
}

第一条if语句判断是否有空格。如果队列已满并且语句返回错误结果,CPU 将进入节流状态,这意味着它因输入流量而过载,因此正在丢弃所有其他帧。此处未显示设置限制的代码,但会出现在以下有关拥塞管理的部分中。

The first if statement determines whether there is space. If the queue is full and the statement returns a false result, the CPU is put into a throttle state , which means that it is overloaded by input traffic and therefore is dropping all further frames. The code instituting the throttle is not shown here, but appears in the following section on congestion management.

然而,如果队列上有空间,则不足以确保帧被接受。CPU 可能已经处于“节流”状态(由第三条if语句确定),在这种情况下,帧会被丢弃。

If there is space on the queue, however, that is not sufficient to ensure that the frame is accepted. The CPU could already be in the "throttle" state (as determined by the third if statement), in which case, the frame is dropped.

当队列为空时,可以解除节流状态。这就是第二个 if语句测试的内容。当队列中有数据并且CPU处于节流状态时,帧被丢弃。但是,当队列为空且 CPU 处于节流状态(if此处显示的代码后半部分中的语句测试该状态)时,节流状态就会解除。[ * ]

The throttle state can be lifted when the queue is empty. This is what the second if statement tests for. When there is data on the queue and the CPU is in the throttle state, the frame is dropped. But when the queue is empty and the CPU is in the throttle state (which an if statement tests for in the second half of the code shown here), the throttle state is lifted.[*]

dev_hold(skb->dev)调用会增加设备的引用计数,以便在完全处理该缓冲区之前无法删除该设备。由 完成的相应减量dev_put发生在 内部net_rx_action,我们将在本章后面进行分析。

The dev_hold(skb->dev) call increases the reference count for the device so that the device cannot be removed until this buffer has been completely processed. The corresponding decrement, done by dev_put, takes place inside net_rx_action, which we will analyze later in this chapter.

如果所有测试都令人满意,则缓冲区将排队到输入队列中 _ _skb_queue_tail(&queue->input_pkt_queue,skb),为 CPU 恢复 IRQ 状态,并且函数返回。

If all tests are satisfactory, the buffer is queued into the input queue with _ _skb_queue_tail(&queue->input_pkt_queue,skb), the IRQ's status is restored for the CPU, and the function returns.

对帧进行排队非常快,因为它不涉及任何内存复制,只涉及指针操作。input_pkt_queue是一个指针列表。_ _skb_queue_tail将指向新缓冲区的指针添加到列表中,而不复制缓冲区。

Queuing the frame is extremely fast because it does not involve any memory copying, just pointer manipulation. input_pkt_queue is a list of pointers. _ _skb_queue_tail adds the pointer to the new buffer to the list, without copying the buffer.

软件NET_RX_SOFTIRQ中断被安排执行netif_rx_schedule。请注意,netif_rx_schedule仅当新缓冲区添加到空队列时才会调用。原因是,如果队列不为空,则 NET_RX_SOFTIRQ已经被调度,无需再次执行。

The NET_RX_SOFTIRQ software interrupt is scheduled for execution with netif_rx_schedule. Note that netif_rx_schedule is called only when the new buffer is added to an empty queue. The reason is that if the queue is not empty, NET_RX_SOFTIRQ has already been scheduled and there is no need to do it again.

在第 9 章的“挂起的软中断处理”部分中,我们了解了内核如何处理预定的软件中断。在接下来的“处理 NET_RX_SOFTIRQ:net_rx_action ”部分中,我们将看到软中断处理程序的内部结构。NET_RX_SOFTIRQ

In the section "Pending softirq Handling" in Chapter 9, we saw how the kernel takes care of scheduled software interrupts. In the upcoming section "Processing the NET_RX_SOFTIRQ: net_rx_action," we will see the internals of the NET_RX_SOFTIRQ softirq's handler.

拥塞管理

Congestion Management

拥塞管理是输入帧处理任务的重要组成部分。过载的 CPU 可能会变得不稳定,并给系统带来很大的延迟。第 9 章的“中断”部分解释了为什么高负载产生的中断会导致系统瘫痪。因此,需要拥塞管理机制来确保系统的稳定性在高网络负载下不受到损害。在高流量负载下降低CPU负载的常见方法包括:

Congestion management is an important component of the input frame-processing task. An overloaded CPU can become unstable and introduce a big latency into the system. The section "Interrupts" in Chapter 9 explained why the interrupts generated by a high load can cripple the system. For this reason, congestion management mechanisms are needed to make sure the system's stability is not compromised under high network load. Common ways to reduce the CPU load under high traffic loads include:

如果可能的话减少中断次数
Reducing the number of interrupts if possible

这是通过对驱动程序进行编码以使用单个中断处理多个帧(请参阅第 9 章中的“在中断期间处理多个帧”部分)或使用 NAPI 来完成的。

This is accomplished by coding drivers either to process several frames with a single interrupt (see the section "Processing Multiple Frames During an Interrupt" in Chapter 9), or to use NAPI.

在入口路径中尽早丢弃帧
Discarding frames as early as possible in the ingress path

如果代码知道较高层将丢弃某个帧,则可以通过快速丢弃该帧来节省 CPU 时间。例如,如果设备驱动程序知道入口队列已满,它可以立即丢弃帧,而不是将其中继到内核并让后者丢弃它。

If code knows that a frame is going to be dropped by higher layers, it can save CPU time by dropping the frame quickly. For instance, if a device driver knew that the ingress queue was full, it could drop a frame right away instead of relaying it to the kernel and having the latter drop it.

第二点是我们在本节中介绍的内容。

The second point is what we cover in this section.

类似的优化适用于出口路径:如果设备驱动程序没有资源来接受新帧进行传输(即,如果设备内存不足),则让内核推送新帧将浪费 CPU 时间。帧向下传送给驱动程序。这一点将在第 11 章的“启用和禁用传输”部分中讨论。

A similar optimization applies to the egress path: if a device driver does not have resources to accept new frames for transmission (that is, if the device is out of memory), it would be a waste of CPU time to have the kernel pushing new frames down to the driver for transmission. This point is discussed in Chapter 11 in the section "Enabling and Disabling Transmissions."

在接收和发送这两种情况下,内核提供了一组函数来设置、清除和检索接收和发送队列的状态,这允许设备驱动程序(在接收时)和核心内核(在发送时)执行刚才提到的优化。

In both cases, reception and transmission, the kernel provides a set of functions to set, clear, and retrieve the status of the receive and transmit queues, which allows device drivers (on reception) and the core kernel (on transmission) to perform the optimizations just mentioned.

拥塞程度的一个很好的指示是已接收并等待处理的帧的数量。当设备驱动程序使用 NAPI 时,由驱动程序实现任何拥塞控制机制。这是因为入口帧保存在 NIC 内存中或驱动程序管理的接收环中,并且内核无法跟踪流量拥塞。相反,当设备驱动程序不使用 NAPI 时,帧将添加到每个 CPU 队列 ( softnet_data->input_pkt_queue),并且内核会跟踪队列的拥塞级别。在本节中,我们将介绍后一种情况。

A good indication of the congestion level is the number of frames that have been received and are waiting to be processed. When a device driver uses NAPI, it is up to the driver to implement any congestion control mechanism. This is because ingress frames are kept in the NIC's memory or in the receive ring managed by the driver, and the kernel cannot keep track of traffic congestion. In contrast, when a device driver does not use NAPI, frames are added to per-CPU queues (softnet_data->input_pkt_queue) and the kernel keeps track of the congestion level of the queues. In this section, we cover this latter case.

队列论是一个复杂的主题,本书不讨论数学细节。我只想满足一个简单的观点:队列中当前的帧数并不一定代表真正的拥塞程度。平均队列长度可以更好地指导队列的状态。跟踪平均值可以防止系统将突发流量错误地分类为拥塞。softnet_data在 Linux 网络堆栈中,平均队列长度由结构体cng_level和 的两个字段报告avg_blog,这两个字段在第 9 章的“softnet_data 结构”中介绍过。

Queue theory is a complex topic, and this book is not the place for the mathematical details. I will content myself with one simple point: the current number of frames in the queue does not necessarily represent the real congestion level. An average queue length is a better guide to the queue's status. Keeping track of the average keeps the system from wrongly classifying a burst of traffic as congestion. In the Linux network stack, average queue length is reported by two fields of the softnet_data structure, cng_level and avg_blog, that were introduced in "softnet_data Structure" in Chapter 9.

作为平均值,可能比任何时间avg_blog的长度都更大或更小。input_pkt_queue前者代表近代历史,后者代表现状。因此,它们有两个不同的用途:

Being an average, avg_blog could be both bigger and smaller than the length of input_pkt_queue at any time. The former represents recent history and the latter represents the present situation. Because of that, they are used for two different purposes:

  • 默认情况下,每次帧在 中排队时input_pkt_queueavg_blog都会更新,并计算相关的拥塞级别并将其保存到 中cng_level。后者用作返回值,netif_rx以便调用此函数的设备驱动程序获得有关队列状态的反馈,并可以相应地更改其行为。

  • By default, every time a frame is queued into input_pkt_queue, avg_blog is updated and an associated congestion level is computed and saved into cng_level. The latter is used as the return value by netif_rx so that the device driver that called this function is given a feedback about the queue status and can change its behavior accordingly.

  • 中的帧数input_pkt_queue不能超过最大大小。当达到该大小时,后续帧将被丢弃,因为 CPU 显然已不堪重负。

  • The number of frames in input_pkt_queue cannot exceed a maximum size. When that size is reached, following frames are dropped because the CPU is clearly overwhelmed.

让我们回到拥塞程度的计算和使用。avg_blogcng_level在内部更新get_sample_stats,由 调用 netif_rx

Let's go back to the computation and use of the congestion level. avg_blog and cng_level are updated inside get_sample_stats, which is called by netif_rx.

目前,很少有设备驱动程序使用来自netif_rx. 此反馈最常见的用途是更新设备驱动程序本地的统计信息。有关反馈的更有趣的使用,请参阅drivers/net/tulip/de2104x.c:当netif_rxreturnsNET_RX_DROP时,局部变量 drop 设置为 1,这导致主循环开始丢弃接收环中的帧而不是处理它们。

At the moment, few device drivers use the feedback from netif_rx. The most common use of this feedback is to update statistics local to the device drivers. For a more interesting use of the feedback, see drivers/net/tulip/de2104x.c: when netif_rx returns NET_RX_DROP, a local variable drop is set to 1, which causes the main loop to start dropping the frames in the receive ring instead of processing them.

只要入口队列input_pkt_queue未满,设备驱动程序的工作就是使用来自的反馈来netif_rx处理拥塞。当情况变得更糟并且输入队列已满时,内核就会发挥作用并使用该softnet_data->throttle标志来禁用 CPU 的帧接收。(请记住,softnet_data每个 CPU 都有一个结构。)

So long as the ingress queue input_pkt_queue is not full, it is the job of the device driver to use the feedback from netif_rx to handle congestion. When the situation gets worse and the input queue fills in, the kernel comes into play and uses the softnet_data->throttle flag to disable frame reception for the CPU. (Remember that there is a softnet_data structure for each CPU.)

netif_rx 中的拥塞管理

Congestion Management in netif_rx

让我们回过头来netif_rx看看本章上一节中省略的一些代码。以下两个摘录包括之前显示的一些代码,以及显示 CPU 何时处于节流状态的新代码。

Let's go back to netif_rx and look at some of the code that was omitted from the previous section of this chapter. The following two excerpts include some of the code shown previously, along with new code that shows when a CPU is placed in the throttle state.

    if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
        if (queue->input_pkt_queue.qlen) {
            if (队列->节流)
                转到下降;
            …………
            返回队列->cng_level;
        }
        …………
    }

    if (!queue->throttle) {
        队列->节流= 1;
        _ _get_cpu_var(netdev_rx_stat).throttled++;
    }
    if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
        if (queue->input_pkt_queue.qlen) {
            if (queue->throttle)
                goto drop;
            ... ... ...
            return queue->cng_level;
        }
        ... ...    ...
    }

    if (!queue->throttle) {
        queue->throttle = 1;
        _ _get_cpu_var(netdev_rx_stat).throttled++;
    }

softnet_data->throttle当队列变空时被清除。准确地说,是netif_rx在第一帧进入空队列时清除。它也可能发生在 中,正如我们将在“积压处理:process_backlog Poll 虚拟函数process_backlog”部分中看到的那样。

softnet_data->throttle is cleared when the queue gets empty. To be exact, it is cleared by netif_rx when the first frame is queued into an empty queue. It could also happen in process_backlog, as we will see in the section "Backlog Processing: The process_backlog Poll Virtual Function."

平均队列长度和拥塞程度计算

Average Queue Length and Congestion-Level Computation

avg_blog和的值 cng_level总是在 内更新get_sample_stats。后者可以通过两种不同的方式调用:

The value of avg_blog and cng_level is always updated within get_sample_stats. The latter can be invoked in two different ways:

  • 每次接收到新帧时 ( netif_rx)。这是默认设置。

  • Every time a new frame is received (netif_rx). This is the default.

  • 带周期性定时器。要使用这一技术,必须定义OFFLINE_SAMPLE符号。这就是为什么在 中 的netif_rx执行get_sample_stats取决于OFFLINE_SAMPLE符号的定义。默认情况下它是禁用的。

  • With a periodic timer. To use this technique, one has to define the OFFLINE_SAMPLE symbol. That's the reason why in netif_rx, the execution of get_sample_stats depends on the definition of the OFFLINE_SAMPLE symbol. It is disabled by default.

在中等和高流量负载下,第一种方法最终get_sample_stats比第二种方法运行得更频繁。

The first approach ends up running get_sample_stats more often than the second approach under medium and high traffic load.

在这两种情况下,用于计算的公式avg_blog应该简单且快速,因为它可以被频繁调用。使用的公式考虑了最近的历史和现在:

In both cases, the formula used to compute avg_blog should be simple and quick, because it could be invoked frequently. The formula used takes into account the recent history and the present:

平均博客的新值 = (平均博客的旧值 + 队列长度的当前值) / 2
new_value_for_avg_blog = (old_value_of_avg_blog + current_value_of_queue_len) / 2

现在和过去的权重是多少,这并不是一个简单的问题。前面的公式可以快速适应拥塞程度的变化,因为过去(旧值)仅被赋予 50% 的权重,而当前则为另外 50%。

How much to weight the present and the past is not a simple problem. The preceding formula can adapt quickly to changes in the congestion level, since the past (the old value) is given only 50% of the weight and the present the other 50%.

get_sample_stats也会根据第 9 章图 9-4中所示的映射cng_level进行更新。如果定义了符号,则该函数会执行额外的操作,在该操作中它可以随机决定设置更高的级别。这种随机调整需要更多时间来计算,但奇怪的是,可以使内核在一种特定情况下表现更好。avg_blogRAND_LIEcng_level

get_sample_stats also updates cng_level, basing it on avg_blog through the mapping shown earlier in Figure 9-4 in Chapter 9. If the RAND_LIE symbol is defined, the function performs an extra operation in which it can randomly decide to set cng_level one level higher. This random adjustment requires more time to calculate but, oddly enough, can cause the kernel to perform better under one specific scenario.

让我们多花几句话来谈谈随机谎言的好处。不要将此行为与随机早期检测 (RED) 混淆。

Let's spend a few more words on the benefits of random lies. Do not confuse this behavior with Random Early Detection (RED).

在只有一个接口的系统中,如果没有拥塞,那么到处丢弃随机帧并没有什么意义。它只会降低吞吐量。但假设我们有多个接口共享一个输入队列,并且一台设备的流量负载远高于其他设备。由于贪婪设备比其他设备更快地填充共享入口队列,因此后者通常会在入口队列中找不到空间,因此它们的帧将被丢弃。[ * ]贪婪设备也会看到一些帧被丢弃,但与其负载不成比例。当具有多个接口的系统遇到拥塞时,它应该根据其负载按比例丢弃所有设备上的入口帧。该RAND_LIE代码在这种情况下使用时增加了一些公平性:随机丢弃额外的帧最终应根据负载按比例丢弃它们。

In a system with only one interface, it does not really make sense to drop random frames here and there if there is no congestion; it would simply lower the throughput. But let's suppose we have multiple interfaces sharing an input queue and one device with a traffic load much higher than the others. Since the greedy device fills the shared ingress queue faster than the other devices, the latter will often find no space in the ingress queue and therefore their frames will be dropped.[*] The greedy device will also see some of its frames dropped, but not proportionally to its load. When a system with multiple interfaces experiences congestion, it should drop ingress frames across all the devices proportionally to their loads. The RAND_LIE code adds some fairness when used in this context: dropping extra frames randomly should end up dropping them proportionally to the load.

处理 NET_RX_SOFTIRQ:net_rx_action

Processing the NET_RX_SOFTIRQ: net_rx_action

net_rx_action是用于处理传入帧的下半部函数。每当驱动程序通知内核输入帧的存在时,就会触发它的执行。图 10-5显示了该函数的控制流程。

net_rx_action is the bottom-half function used to process incoming frames. Its execution is triggered whenever a driver notifies the kernel about the presence of input frames. Figure 10-5 shows the flow of control through the function.

帧可以在两个地方等待net_rx_action处理它们:

Frames can wait in two places for net_rx_action to process them:

共享 CPU 特定队列
A shared CPU-specific queue

非 NAPI 设备的中断处理程序调用netif_rx,将帧放入softnet_data->input_pkt_queue运行中断处理程序的 CPU 中。

Non-NAPI devices' interrupt handlers, which call netif_rx, place frames into the softnet_data->input_pkt_queue of the CPU on which the interrupt handlers run.

设备内存
Device memory

NAPI 驱动程序使用的方法poll直接从设备(或设备驱动程序接收环)提取帧。

The poll method used by NAPI drivers extracts frames directly from the device (or the device driver receive rings).

“旧驱动程序接口与新驱动程序接口”部分展示了如何通知内核需要net_rx_action在两种情况下运行。

The section "Old Versus New Driver Interfaces" showed how the kernel is notified about the need to run net_rx_action in both cases.

net_rx_action 函数

图 10-5。net_rx_action 函数

Figure 10-5. net_rx_action function

其工作net_rx_action非常简单:浏览poll_list其入口队列中包含某些内容的设备列表,并为每个设备调用关联的poll虚拟功能,直到满足以下条件之一:

The job of net_rx_action is pretty simple: to browse the poll_list list of devices that have something in their ingress queue and invoke for each one the associated poll virtual function until one of the following conditions is met:

  • 列表中没有更多设备。

  • There are no more devices in the list.

  • net_rx_action运行时间太长,因此应该释放 CPU,以免成为 CPU 占用者。

  • net_rx_action has run for too long and therefore it is supposed to release the CPU so that it does not become a CPU hog.

  • 已出队并已处理的帧数已达到给定上限 ( budget)。budget在函数开头初始化为 to ,在net/core/dev.cnetdev_max_backlog中定义为 300。

  • The number of frames already dequeued and processed has reached a given upper bound limit (budget). budget is initialized at the beginning of the function to netdev_max_backlog, which is defined in net/core/dev.c as 300.

正如我们将在下一节中看到的,net_rx_action 调用驱动程序的poll虚拟函数并部分依赖于该函数来遵守这些约束。

As we will see in the next section, net_rx_action calls the driver's poll virtual function and depends partly on this function to obey these constraints.

正如我们在“管理队列和调度下半部分”部分中看到的,队列的大小仅限于 的值netdev_max_backlog。该值被视为 的预算net_rx_action。但是,由于在启用中断的情况下运行,因此在 运行net_rx_action时可以将新帧添加到设备的输入队列中。net_rx_action因此,可用帧的数量可能会大于budget,并且net_rx_action 必须采取措施以确保在这种情况下它不会运行太长时间。

The size of the queue, as we saw in the section "Managing Queues and Scheduling the Bottom Half," is restricted to the value of netdev_max_backlog. This value is considered the budget for net_rx_action. However, because net_rx_action runs with interrupts enabled, new frames could be added to a device's input queue while net_rx_action is running. Thus, the number of available frames could become greater than budget, and net_rx_action has to take action to make sure it does not run too long in such cases.

现在我们来详细看看net_rx_action里面做了什么:

Now we will see in detail what net_rx_action does inside:

静态无效net_rx_action(结构softirq_action * h)
{
    struct softnet_data *queue = &_ _get_cpu_var(softnet_data);
    无符号长 start_time = jiffies;
    int 预算 = netdev_max_backlog;

    local_irq_disable();
static void net_rx_action(struct softirq_action *h)
{
    struct softnet_data *queue = &_ _get_cpu_var(softnet_data);
    unsigned long start_time = jiffies;
    int budget = netdev_max_backlog;

    local_irq_disable( );

如果当前设备尚未使用其全部配额,则有机会使用poll虚拟函数将缓冲区从其队列中出队:

If the current device has not yet used its entire quota, it is given a chance to dequeue buffers from its queue with the poll virtual function:

    while (!list_empty(&queue->poll_list)) {
        结构体net_device *dev;

        if (预算 <= 0 || jiffies - start_time > 1)
            转到softnet_break;

        local_irq_enable();

        dev = list_entry(queue->poll_list.next, struct net_device, poll_list);
    while (!list_empty(&queue->poll_list)) {
        struct net_device *dev;

        if (budget <= 0 || jiffies - start_time > 1)
            goto softnet_break;

        local_irq_enable( );

        dev = list_entry(queue->poll_list.next, struct net_device, poll_list);

如果dev->poll由于设备配额不够大而无法使入口队列中的所有缓冲区出队而返回(在这种情况下,返回值非零),则设备将移至 的末尾poll_list

If dev->poll returns because the device quota was not large enough to dequeue all the buffers in the ingress queue (in which case, the return value is nonzero), the device is moved to the end of poll_list:

        if (dev->quota <= 0 || dev->poll(dev, &budget)) {
            local_irq_disable();
            list_del(&dev->poll_list);
            list_add_tail(&dev->poll_list, &queue->poll_list);
            if (dev->配额 < 0)
                dev->配额 += dev->权重;
            别的
                dev->配额 = dev->权重;
        } 别的 {
        if (dev->quota <= 0 || dev->poll(dev, &budget)) {
            local_irq_disable( );
            list_del(&dev->poll_list);
            list_add_tail(&dev->poll_list, &queue->poll_list);
            if (dev->quota < 0)
                dev->quota += dev->weight;
            else
                dev->quota = dev->weight;
        } else {

相反,当poll设法清空设备入口队列时,net_rx_action不会从 中删除设备poll_listpoll 应该通过调用来处理它netif_rx_complete_ _netif_rx_complete如果本地 CPU 上禁用了 IRQ,也可以调用)。这将 process_backlog在下一节的函数中进行说明。

When instead poll manages to empty the device ingress queue, net_rx_action does not remove the device from poll_list: poll is supposed to take care of it with a call to netif_rx_complete (_ _netif_rx_complete can also be called if IRQs are disabled on the local CPU). This will be illustrated in the process_backlog function in the next section.

此外,请注意,它budget是通过引用poll虚函数传递的;这是因为该函数将返回一个反映其处理的帧的新预算。主循环在每次传递时进行net_rx_action检查budget,以便不超过总体限制。换句话说,budget允许net_rx_actionpoll功能合作以保持在它们的限制之内。

Furthermore, note that budget was passed by reference to the poll virtual function; this is because that function will return a new budget that reflects the frames it processed. The main loop in net_rx_action checks budget at each pass so that the overall limit is not exceeded. In other words, budget allows net_rx_action and the poll function to cooperate to stay within their limit.

            dev_put(dev);
            local_irq_disable();
        }
    }
出去:
    local_irq_enable();
    返回;
            dev_put(dev);
            local_irq_disable( );
        }
    }
out:
    local_irq_enable( );
    return;

net_rx_action当缓冲区仍留在入口队列中时强制返回时,将执行最后一段代码。在这种情况下,NET_RX_SOFTIRQ软中断会再次调度执行,以便net_rx_action 稍后调用并处理剩余的缓冲区:

This last piece of code is executed when net_rx_action is forced to return while buffers are still left in the ingress queue. In this case, the NET_RX_SOFTIRQ softirq is scheduled again for execution so that net_rx_action will be invoked later and will take care of the remaining buffers:

软网中断:
    _ _get_cpu_var(netdev_rx_stat).time_squeeze++;
    _ _raise_softirq_irqoff(NET_RX_SOFTIRQ);
    转到出去;
}
softnet_break:
    _ _get_cpu_var(netdev_rx_stat).time_squeeze++;
    _ _raise_softirq_irqoff(NET_RX_SOFTIRQ);
    goto out;
}

请注意,仅在操作 要轮询的设备列表时(即访问其结构实例时)net_rx_action禁用中断 。NETPOLL 功能使用的 和 调用已被省略 。如果您可以访问内核源代码,请参阅net/core/dev.c中的net_rx_action了解详细信息。local_irq_disablepoll_listsoftnet_datanetpoll_poll_locknetpoll_poll_unlock

Note that net_rx_action disables interrupts with local_irq_disable only while manipulating the poll_list list of devices to poll (i.e., when accessing its softnet_data structure instance). The netpoll_poll_lock and netpoll_poll_unlock calls, used by the NETPOLL feature, have been omitted. If you can access the kernel source code, see net_rx_action in net/core/dev.c for details.

积压处理:process_backlog Poll 虚拟函数

Backlog Processing: The process_backlog Poll Virtual Function

poll数据结构的虚函数,net_device执行该函数来net_rx_action处理积压队列对于那些不使用 NAPI 的设备,默认情况下将其初始化为process_backlogin net_dev_init

The poll virtual function of the net_device data structure, which is executed by net_rx_action to process the backlog queue of a device, is initialized by default to process_backlog in net_dev_init for those devices not using NAPI.

从内核 2.6.12 开始,只有少数设备驱动程序使用 NAPI,并dev->poll使用指向其自己的函数的指针进行初始化:drivers/net/tg3.c中的 Broadcom Tigon3 以太网驱动程序是第一个采用 NAPI 的驱动程序,并且是一个值得一看的好例子。在本节中,我们将分析net/core/dev.cprocess_backlog中定义的默认处理程序。它的实现与使用 NAPI 的设备驱动程序的方法非常相似(例如,您可以与进行比较)。pollprocess_backlogtg3_poll

As of kernel 2.6.12, only a few device drivers use NAPI, and initialize dev->poll with a pointer to a function of its own: the Broadcom Tigon3 Ethernet driver in drivers/net/tg3.c was the first one to adopt NAPI and is a good example to look at. In this section, we will analyze the default handler process_backlog defined in net/core/dev.c. Its implementation is very similar to that of a poll method of a device driver using NAPI (you can, for instance, compare process_backlog to tg3_poll).

然而,由于process_backlog可以处理共享相同入口队列的一堆设备,因此需要考虑一个重要的差异。运行时process_backlog,硬件中断被启用,因此该函数可以被抢占。因此,对softnet_data结构的访问始终通过使用 禁用本地 CPU 上的中断来保护local_irq_disable,尤其是对 的调用_ _skb_dequeue。使用 NAPI 的设备驱动程序不需要此锁:[ * ]poll当调用其方法时,设备的硬件中断被禁用。此外,每个设备都有自己的队列。

However, since process_backlog can take care of a bunch of devices sharing the same ingress queue, there is one important difference to take into account. When process_backlog runs, hardware interrupts are enabled, so the function could be preempted. For this reason, accesses to the softnet_data structure are always protected by disabling interrupts on the local CPU with local_irq_disable, especially the calls to _ _skb_dequeue. This lock is not needed by a device driver using NAPI:[*] when its poll method is invoked, hardware interrupts are disabled for the device. Moreover, each device has its own queue.

我们来看看主要部分process_backlog其流程图如图10-6所示。

Let's see the main parts of process_backlog. Figure 10-6 shows its flowchart.

该函数从一些初始化开始:

The function starts with a few initializations:

static int process_backlog(struct net_device *backlog_dev, int *budget)
{
    工作量 = 0;
    int 配额 = min(backlog_dev->quota, *预算);
    struct softnet_data *queue = &_ _get_cpu_var(softnet_data);
    无符号长 start_time = jiffies;
static int process_backlog(struct net_device *backlog_dev, int *budget)
{
    int work = 0;
    int quota = min(backlog_dev->quota, *budget);
    struct softnet_data *queue = &_ _get_cpu_var(softnet_data);
    unsigned long start_time = jiffies;

然后开始主循环,它尝试使输入队列中的所有缓冲区出队,并且仅在满足以下条件之一时才会中断:

Then begins the main loop, which tries to dequeue all the buffers in the input queue and is interrupted only if one of the following conditions is met:

  • 队列变空。

  • The queue becomes empty.

  • 设备的配额已用完。

  • The device's quota has been used up.

  • 该功能已运行太长时间。

  • The function has been running for too long.

最后两个条件与约束 的条件类似net_rx_action。因为process_backlog是在 in 的循环内调用的net_rx_action,所以后者只有在process_backlog合作时才能遵守其约束。因此,net_rx_action将其剩余预算传递给process_backlog,后者将其配额设置为该输入参数 ( budget) 与其自身配额中的最小值。

The last two conditions are similar to the ones that constrain net_rx_action. Because process_backlog is called within a loop in net_rx_action, the latter can respect its constraints only if process_backlog cooperates. For this reason, net_rx_action passes its leftover budget to process_backlog, and the latter sets its quota to the minimum of that input parameter (budget) and its own quota.

budgetnet_rx_action启动时初始化为300。默认值为dev->quota64(大多数设备都使用默认值)。让我们看一下多个设备的队列已满的情况。在此函数中运行的前四个设备收到的值budget大于其内部配额 64,并且可以清空其队列。下一个设备可能必须在发送其队列的一部分后停止。也就是说,出队的缓冲区数量process_backlog取决于设备配置 ( dev->quota) 和其他设备上的流量负载 ( budget)。这确保了设备之间更加公平。

budget is initialized by net_rx_action to 300 when it starts. The default value for dev->quota is 64 (and most devices stick with the default). Let's examine a case where several devices have full queues. The first four devices to run within this function receive a value of budget greater than their internal quota of 64, and can empty their queues. The next device may have to stop after sending a part of its queue. That is, the number of buffers dequeued by process_backlog depends both on the device configuration (dev->quota), and on the traffic load on the other devices (budget). This ensures some more fairness among the devices.

process_backlog函数

图 10-6。process_backlog函数

Figure 10-6. process_backlog function

    为了 (;;) {
        结构体sk_buff *skb;
        结构体net_device *dev;

        local_irq_disable();
        skb = _ _skb_dequeue(&queue->input_pkt_queue);
        如果(!skb)
            转到工作完成;
        local_irq_enable();

        dev = skb->dev;

        netif_receive_skb(skb);

        dev_put(dev);

        工作++;
        if (工作 >= 配额 || jiffies - start_time > 1)
            休息;
    for (;;) {
        struct sk_buff *skb;
        struct net_device *dev;

        local_irq_disable( );
        skb = _ _skb_dequeue(&queue->input_pkt_queue);
        if (!skb)
            goto job_done;
        local_irq_enable( );

        dev = skb->dev;

        netif_receive_skb(skb);

        dev_put(dev);

        work++;
        if (work >= quota || jiffies - start_time > 1)
            break;

netif_receive_skb是处理帧的函数;下一节将对此进行描述。poll所有虚拟函数(NAPI 和非 NAPI)都使用它。

netif_receive_skb is the function that processes the frame; it is described in the next section. It is used by all poll virtual functions, both NAPI and non-NAPI.

设备的配额根据成功出列的缓冲区数量进行更新。如前所述,输入参数budget也会更新,因为需要它来net_rx_action跟踪它可以继续执行多少工作:

The device's quota is updated based on the number of buffers successfully dequeued. As explained earlier, the input parameter budget is also updated because it is needed by net_rx_action to keep track of how much work it can continue to do:

    backlog_dev->配额-=工作;
    *预算-=工作;
    返回-1;
    backlog_dev->quota -= work;
    *budget -= work;
    return -1;

job_done如果输入队列为空,则前面显示的主循环会跳转到标签。如果该函数到达此点,则可以清除节流状态(如果已设置)并且可以从 中移除该设备 poll_list。该_ _LINK_STATE_RX_SCHED标志也会被清除,因为设备在输入队列中没有任何内容,因此不需要安排它进行积压处理。

The main loop shown earlier jumps to the label job_done if the input queue is emptied. If the function reaches this point, the throttle state can be cleared (if it was set) and the device can be removed from poll_list. The _ _LINK_STATE_RX_SCHED flag is also cleared since the device does not have anything in the input queue and therefore it does not need to be scheduled for backlog processing.

任务完成:
    backlog_dev->配额-=工作;
    *预算-=工作;

    list_del(&backlog_dev->poll_list);
    smp_mb_ _before_clear_bit( );
    netif_poll_enable(backlog_dev);

    if (队列->节流)
        队列->节流= 0;
    local_irq_enable();
    返回0;
}
job_done:
    backlog_dev->quota -= work;
    *budget -= work;

    list_del(&backlog_dev->poll_list);
    smp_mb_ _before_clear_bit( );
    netif_poll_enable(backlog_dev);

    if (queue->throttle)
        queue->throttle = 0;
    local_irq_enable( );
    return 0;
}

process_backlog实际上,与 NAPI 驱动程序的方法之间还有另一个区别poll 。让我们回到drivers/net/tg3.c作为例子:

Actually, there is another difference between process_backlog and a NAPI driver's poll method. Let's return to drivers/net/tg3.c as an example:

    如果(完成){
           spin_lock_irqsave(&tp->lock, 标志);
        __netif_rx_complete(netdev);
        tg3_restart_ints(tp);
           spin_unlock_irqrestore(&tp->lock, flags);
    }
    if (done) {
           spin_lock_irqsave(&tp->lock, flags);
        _ _netif_rx_complete(netdev);
        tg3_restart_ints(tp);
           spin_unlock_irqrestore(&tp->lock, flags);
    }

donejob_done这里是in的对应项process_backlog,其含义与队列为空相同。此时,在 NAPI 驱动程序中,函数_ _netif_rx_complete (在同一文件中定义)从列表中删除设备poll_list,这是一个直接执行的任务process_backlog。最后,NAPI 驱动程序重新启用设备的中断。正如我们在本节开头所预期的那样,process_backlog在启用中断的情况下运行。

done here is the counterpart of job_done in process_backlog, with the same meaning that the queue is empty. At this point, in the NAPI driver, the _ _netif_rx_complete function (defined in the same file) removes the device from the poll_list list, a task that process_backlog does directly. Finally, the NAPI driver re-enables interrupts for the device. As we anticipated at the beginning of the section, process_backlog runs with interrupts enabled.

入口帧处理

Ingress Frame Processing

如上一节所述,netif_receive_skb是虚拟函数用来处理入口帧的辅助函数poll如图10-7所示。

As mentioned in the previous section, netif_receive_skb is the helper function used by the poll virtual function to process ingress frames. It is illustrated in Figure 10-7.

L2 和 L3 都允许多种协议。每个设备驱动程序都与特定的硬件类型(例如以太网)相关联,因此它很容易解释 L2 标头并提取告诉它正在使用哪个 L3 协议(如果有)的信息(请参阅第 13 章。当net_rx_action调用时,L3 协议标识符已经从 L2 标头中提取出来并由skb->protocol设备驱动程序存储到其中。

Multiple protocols are allowed by both L2 and L3. Each device driver is associated with a specific hardware type (e.g., Ethernet), so it is easy for it to interpret the L2 header and extract the information that tells it which L3 protocol is being used, if any (see Chapter 13). When net_rx_action is invoked, the L3 protocol identifier has already been extracted from the L2 header and stored into skb->protocol by the device driver.

其三项主要任务netif_receive_skb是:

The three main tasks of netif_receive_skb are:

  • 将帧的副本传递给每个协议分接头(如果有正在运行)

  • Passing a copy of the frame to each protocol tap, if any are running

  • 将帧的副本传递给与skb->protocol [ * ]关联的 L3 协议处理程序

  • Passing a copy of the frame to the L3 protocol handler associated with skb->protocol [*]

  • 处理需要在这一层处理的功能,特别是桥接(在第四部分中描述)

  • Taking care of those features that need to be handled at this layer, notably bridging (which is described in Part IV)

如果没有关联的协议处理程序,skb->protocol并且处理的任何功能netif_receive_skb(例如桥接)都没有消耗该帧,则该帧将被丢弃,因为内核不知道如何处理它。

If no protocol handler is associated with skb->protocol and none of the features handled in netif_receive_skb (such as bridging) consumes the frame, it is dropped because the kernel doesn't know how to process it.

在将输入帧传递给这些协议处理程序之前,netif_receive_skb必须处理一些可以改变帧命运的功能。

Before delivering an input frame to these protocol handlers, netif_receive_skb must handle a few features that can change the destiny of the frame.

netif_receive_skb 函数

图 10-7。netif_receive_skb 函数

Figure 10-7. The netif_receive_skb function

绑定允许将一组接口组合在一起并被视为单个接口。如果接收帧的接口属于这样一个组,则在将数据包传递到 L3 处理程序sk_buff之前,必须将数据结构中对接收接口的引用更改为该组中具有主设备角色的设备。netif_receive_skb这就是 的目的skb_bond

Bonding allows a group of interfaces to be grouped together and be treated as a single interface. If the interface from which the frame was received belonged to one such group, the reference to the receiving interface in the sk_buff data structure must be changed to the device in the group with the role of master before netif_receive_skb delivers the packet to the L3 handler. This is the purpose of skb_bond.

         skb_bond(skb);
         skb_bond(skb);

第 13 章详细介绍了将帧传递给嗅探器和协议处理程序的过程。

The delivery of the frame to the sniffers and protocol handlers is covered in detail in Chapter 13.

一旦所有协议嗅探器都收到了数据包的副本,并且在给真正的协议处理程序提供其副本之前,必须处理转向器、入口流量控制和桥接功能(请参阅下一节)。

Once all of the protocol sniffers have received their copy of the packet, and before the real protocol handler is given its copy, Diverter, ingress Traffic Control, and bridging features must be handled (see the next section).

当桥接代码和入口流量控制代码都没有消耗该帧时,后者将被传递到 L3 协议处理程序(通常每个协议只有一个处理程序,但可以注册多个处理程序)。在较旧的内核版本中,这是唯一需要的处理。内核网络堆栈增强得越多,添加的功能越多(在这一层和其他层中),数据包通过网络堆栈的路径就会变得越复杂。

When neither the bridging code nor the ingress Traffic Control code consumes the frame, the latter is passed to the L3 protocol handlers (usually there is only one handler per protocol, but multiple ones can be registered). In older kernel versions, this was the only processing needed. The more the kernel network stack was enhanced and the more features that were added (in this layer and in others), the more complex the path of a packet through the network stack became.

此时,接收部分已完成,将由 L3 协议处理程序决定如何处理数据包:

At this point, the reception part is complete and it will be up to the L3 protocol handlers to decide what to do with the packets:

  • 将它们传送给在接收工作站中运行的接收者(应用程序)。

  • Deliver them to a recipient (application) running in the receiving workstation.

  • 删除它们(例如,在健全性检查失败期间)。

  • Drop them (for instance, during a failed sanity check).

  • 转发他们。

  • Forward them.

最后一个选择对于路由器很常见,但对于单接口工作站则不然。第五部分和第六部分详细介绍了 L3 行为。

The last choice is common for routers, but not for single-interface workstations. Parts V and VI cover L3 behavior in detail.

内核根据目标 L3 地址确定数据包是否发送至其本地系统。我将把这个过程的讨论推迟到第七部分;我们暂时假设,如果数据包的地址是本地系统,那么它会以某种方式传递到上面的层(即 TCP、UDP、ICMP 等),否则会传递到其他层(参见图ip_forward9-2 )第 9 章)。

The kernel determines from the destination L3 address whether the packet is addressed to its local system. I will postpone a discussion of this process until Part VII; let's take it for granted for the moment that somehow the packet will be delivered to the above layers (i.e., TCP, UDP, ICMP, etc.) if it is addressed to the local system, and to ip_forward otherwise (see Figure 9-2 in Chapter 9).

我们对帧接收如何工作的长期讨论到此结束。下一章描述帧是如何传输的。第二条路径包括本地生成的帧和接收到的需要转发的帧。

This finishes our long discussion of how frame reception works. The next chapter describes how frames are transmitted. This second path includes both frames generated locally and received frames that need to be forwarded.

处理特殊功能

Handling special features

netif_receive_skb检查是否有任何 Netpoll 客户端想要使用该帧。

netif_receive_skb checks whether any Netpoll client would like to consume the frame.

流量控制始终用于在出口路径上实现 QoS。但是,在最新版本的内核中,您也可以对入口流量配置过滤器和操作。基于这样的配置,ing_filter可以决定输入缓冲器将被丢弃或者它将在其他地方进一步处理(即,帧被消耗)。

Traffic Control has always been used to implement QoS on the egress path. However, with recent releases of the kernel, you can configure filters and actions on ingress traffic, too. Based on such a configuration, ing_filter may decide that the input buffer is to be dropped or that it will be processed further somewhere else (i.e., the frame is consumed).

Diverter 允许内核更改原本发送到其他主机的帧的 L2 目标地址,以便将帧转移到本地主机。此功能有许多可能的用途,如 http://diverter.sourceforge.net中所述。内核可以配置为确定 Diverter 使用的标准来决定是否转移帧。用于 Diverter 的常见标准包括:

Diverter allows the kernel to change the L2 destination address of frames originally addressed to other hosts so that the frames can be diverted to the local host. There are many possible uses for this feature, as discussed at http://diverter.sourceforge.net. The kernel can be configured to determine the criteria used by Diverter to decide whether to divert a frame. Common criteria used for Diverter include:

  • 所有 IP 数据包(无论 L4 协议如何)

  • All IP packets (regardless of L4 protocol)

  • 所有 TCP 数据包

  • All TCP packets

  • 具有特定端口号的 TCP 数据包

  • TCP packets with specific port numbers

  • 所有UDP数据包

  • All UDP packets

  • 具有特定端口号的UDP数据包

  • UDP packets with specific port numbers

调用handle_diverter 决定是否改变目的MAC地址。除了更改目的MAC地址外,skb->pkt_type还必须更改为PACKET_HOST

The call to handle_diverter decides whether to change the destination MAC address. In addition to the change to the destination MAC address, skb->pkt_type must be changed to PACKET_HOST.

还有另一个 L2 功能可能会影响框架的命运:桥接。桥接是 L3 路由的 L2 对应部分,将在第四部分中讨论。每个数据结构都有一个指向 用于存储表示桥接端口所需的额外信息net_device的数据结构类型的指针。net_bridge_port当接口未启用桥接时,其值为 NULL。当端口配置为桥接端口时,内核仅查看 L2 标头。在这种情况下,内核使用的唯一 L3 信息是与防火墙相关的信息。

Yet another L2 feature could influence the destiny of the frame: Bridging. Bridging, the L2 counterpart of L3 routing, is addressed in Part IV. Each net_device data structure has a pointer to a data structure of type net_bridge_port that is used to store the extra information needed to represent a bridge port. Its value is NULL when the interface has not enabled bridging. When a port is configured as a bridge port, the kernel looks only at L2 headers. The only L3 information the kernel uses in this situation is information pertaining to firewalling.

由于net_rx_action 代表设备驱动程序和 L3 协议处理程序之间的边界,因此必须在此函数中处理桥接功能。当内核支持桥接时,handle_bridge被初始化为一个函数,用于检查帧是否将被传递给桥接代码。当帧被交给桥接代码并且后者使用它时,handle_bridge返回 1。在所有其他情况下,handle_bridge返回 0 并将netif_receive_skb继续处理该帧skb

Since net_rx_action represents the boundary between device drivers and the L3 protocol handlers, it is right in this function that the Bridging feature must be handled. When the kernel has support for bridging, handle_bridge is initialized to a function that checks whether the frame is to be handed to the bridging code. When the frame is handed to the bridging code and the latter consumes it, handle_bridge returns 1. In all other cases, handle_bridge returns 0 and netif_receive_skb will continue processing the frame skb.

if (handle_bridge(skb, &pt_prev, &ret));
    转到出去;
if (handle_bridge(skb, &pt_prev, &ret));
    goto out;



[ * ]如果设备使用 DMA(现在很常见),驱动程序只需要初始化一个指针(不涉及复制)。

[*] If DMA is used by the device, as is pretty common nowadays, the driver needs only to initialize a pointer (no copying is involved).

[ * ]不同的设备类型使用不同的功能;例如,由以太网设备和令牌环接口eth_type_trans使用。tr_type_trans

[*] Different device types use different functions; for instance, eth_type_trans is used by Ethernet devices and tr_type_trans by Token Ring interfaces.

[ * ]有一个有趣的例外:当 SMP 系统的 CPU 死亡时,dev_cpu_callback例程会清空input_pkt_queue关联softnet_data实例的队列。是第9章介绍的dev_cpu_callback注册的回调例程。net_dev_initcpu_chain

[*] There is an interesting exception: when a CPU of an SMP system dies, the dev_cpu_callback routine drains the input_pkt_queue queue of the associated softnet_data instance. dev_cpu_callback is the callback routine registered by net_dev_init in the cpu_chain introduced in Chapter 9.

[ ]netif_rx_ni是其姐妹netif_rx,用于非中断环境。使用它的系统包括drivers/net/tun.c中的 TUN(通用 TUN/TAP)设备驱动程序 。

[] netif_rx_ni is a sister to netif_rx and is used in noninterrupt contexts. Among the systems using it is the TUN (Universal TUN/TAP) device driver in drivers/net/tun.c.

[ * ]input_pkt_queue和都只completion_queue保留指向缓冲区的指针,即使该图看起来好像它们实际上存储了完整的缓冲区。

[*] Both input_pkt_queue and completion_queue keep only the pointers to the buffers, even if the figure makes it look as if they actually store the complete buffers.

[ * ]这适用于非 NAPI 设备。由于 NAPI 设备使用专用队列,因此设备可以选择它们喜欢的最大长度。常见值为 16、32 和 64。10-Gigabit 以太网驱动程序drivers/net/s2io.c使用较大的值 (90)。

[*] This applies to non-NAPI devices. Because NAPI devices use private queues, the devices can select the maximum length they prefer. Common values are 16, 32, and 64. The 10-Gigabit Ethernet driver drivers/net/s2io.c uses a larger value (90).

[ * ]这种情况实际上很少见,因为net_rx_action可能会提前解除油门状态(间接通过process_backlog)。我们将在本章后面看到这一点。

[*] This case is actually rare because net_rx_action probably lifts the throttle state (indirectly via process_backlog) earlier. We will see this later in this chapter.

[ * ]共享队列时,用户有责任与他人公平行事,但这并不总是可能的。NAPI 不会遇到这个问题,因为每个使用 NAPI 的设备都有自己的队列。然而,仍然使用共享输入队列的非 NAPI 驱动程序input_pkt_queue必须忍受其他设备过载的可能性。

[*] When sharing a queue, it is up to the users to behave fairly with others, but that's not always possible. NAPI does not encounter this problem because each device using NAPI has its own queue. However, non-NAPI drivers still using the shared input queue input_pkt_queue have to live with the possibility of overloading by other devices.

[ * ]因为每个CPU都有自己的实例softnet_data,所以不需要额外的锁定来处理SMP。

[*] Because each CPU has its own instance of softnet_data, there is no need for extra locking to take care of SMP.

[ * ]有关协议处理程序的更多详细信息,请参阅第 13 章。

[*] See Chapter 13 for more details on protocol handlers.

第 11 章帧传输

Chapter 11. Frame Transmission

传输是用于离开系统的帧的术语,因为它们是由系统发送的,或者是因为它们正在被转发。在本章中,我们将介绍帧传输数据路径中涉及的主要任务:

Transmission is the term used for frames that leave the system, either because they were sent by the system or because they are being forwarded. In this chapter, we will cover the main tasks involved during the frame transmission data path:

  • 启用和禁用设备的帧传输

  • Enabling and disabling frame transmission for a device

  • 调度设备进行传输

  • Scheduling a device for transmission

  • 从设备出口队列中等待的帧中选择要传输的下一帧

  • Selecting the next frame to transmit among the ones waiting in the device's egress queue

  • 传输本身(我们将检查主要功能)

  • The transmission itself (we will examine the main function)

关于传输的很多内容与我们在第 10 章中讨论的接收过程是对称的:NET_TX_SOFTIRQ是软中断的传输对应项NET_RX_SOFTIRQnet_tx_action是 的对应项net_rx_action,依此类推。因此,如果您已经学习了前一章,您应该会发现很容易理解这一章。图 11-1比较了调度设备进行接收和调度设备进行传输背后的逻辑。以下是更多相似之处:

Much about transmission is symmetric to the reception process we discussed in Chapter 10: NET_TX_SOFTIRQ is the transmission counterpart of the NET_RX_SOFTIRQ softirq, net_tx_action is the counterpart of net_rx_action, and so on. Thus, if you have studied the earlier chapter, you should find it easy to follow this one. Figure 11-1 compares the logic behind scheduling a device for reception and scheduling a device for transmission. Here are some more similarities:

  • poll_list是因具有非空接收队列而被轮询的设备列表。output_queue是要传输内容的设备列表。poll_list和是第 9 章中介绍的结构output_queue的两个字段。softnet_data

  • poll_list is the list of devices that are polled because they have a nonempty receive queue. output_queue is the list of devices that have something to transmit. poll_list and output_queue are two fields of the softnet_data structure introduced in Chapter 9.

  • 只有打开的设备(设置了标志的设备_ _LINK_STATE_START)才能安排接收。只有启用了传输的设备(_ _LINK_STATE_XOFF清除标志的设备)才能安排传输。

  • Only open devices (ones with the _ _LINK_STATE_START flag set) can be scheduled for reception. Only devices with transmission enabled (ones with the _ _LINK_STATE_XOFF flag cleared) can be scheduled for transmission.

  • 当设备被安排接收时,其_ _LINK_STATE_RX_SCHED标志被设置。当设备被安排传输时,其_ _LINK_STATE_SCHED标志被设置。

  • When a device is scheduled for reception, its _ _LINK_STATE_RX_SCHED flag is set. When a device is scheduled for transmission, its _ _LINK_STATE_SCHED flag is set.

dev_queue_xmit对于出口路径和入口路径起着相同的作用netif_rx:每个路径在驱动程序的缓冲区和内核队列之间传输一帧。net_tx_action当有设备等待传输某些内容以及对不再需要的缓冲区进行管理时,都会调用该函数。正如入口流量有队列一样,出口流量也有队列。出口队列 由流量控制(QoS 层)处理的入口实际上比入口复杂得多:后者只是普通的先进先出 (FIFO),而前者可以是分层的,由队列树表示。尽管流量控制也支持入口队列,但它更多地用于监管和管理原因,而不是真正的队列:流量控制不使用入口流量的真实队列,而仅对操作进行分类和应用。

dev_queue_xmit plays the same role for the egress path that netif_rx plays for the ingress path: each transfers one frame between the driver's buffer and the kernel's queue. The net_tx_action function is called both when there are devices waiting to transmit something and to do housekeeping with the buffers that are not needed anymore. Just as there are queues for ingress traffic, there are queues for egress traffic. The egress queues , handled by Traffic Control (the QoS layer), are actually much more complex than the ingress ones: while the latter are just ordinary First In, First Outs (FIFOs), the former can be hierarchical, represented by trees of queues. Even though Traffic Control has support for ingress queueing too, it's used more for policing and management reasons rather than real queuing: Traffic Control does not use real queues for ingress traffic, but only classifies and applies actions.

调度设备:(a)用于接收(RX); (b) 用于传输 (TX)

图 11-1。调度设备:(a)用于接收(RX);(b) 用于传输 (TX)

Figure 11-1. Scheduling a device: (a) for reception (RX); (b) for transmission (TX)

启用和禁用传输

Enabling and Disabling Transmissions

在第10章的“拥塞管理”部分中,我们了解了必须在单个设备或全局上禁用帧接收的一些条件。类似的情况也适用于帧传输。

In the section "Congestion Management" in Chapter 10, we learned about some conditions under which frame reception must be disabled, either on a single device or globally. Something similar applies to frame transmission as well.

出口队列的状态_ _LINK_STATE_XOFF由 中的标志表示net_device->state。可以使用以下函数来操作和检查其值,这些函数定义在 include/linux/netdevice.h中:[ * ]

The status of the egress queue is represented by the flag _ _LINK_STATE_XOFF in net_device->state. Its value can be manipulated and checked with the following functions, defined in include/linux/netdevice.h:[*]

netif_start_queue
netif_start_queue

启用设备的传输。它通常在设备激活时调用,如果需要重新启动已停止的设备,可以稍后再次调用。

Enables transmission for the device. It is usually called when the device is activated and can be called again later if needed to restart a stopped device.

netif_stop_queue
netif_stop_queue

禁用设备的传输。任何在设备上传输内容的尝试都将被拒绝。本节稍后将提供使用此函数的常见情况的示例。

Disables transmission for the device. Any attempt to transmit something on the device will be denied. Later in this section is an example of a common case where this function is used.

netif_queue_stopped
netif_queue_stopped

返回出口队列的状态:启用或禁用。这个函数很简单:

静态内联 int netif_queue_stopped(const struct net_device *dev)
{
    返回 test_bit(_ _LINK_STATE_XOFF, &dev->state);
}

Returns the status of the egress queue: enabled or disabled. This function is simply:

static inline int netif_queue_stopped(const struct net_device *dev)
{
    return test_bit(_ _LINK_STATE_XOFF, &dev->state);
}

只有设备驱动程序才能启用和禁用设备传输。

Only device drivers enable and disable transmission of devices.

为什么设备运行后要停止并启动队列?原因之一是设备可能会暂时用完其内存,从而导致传输尝试失败。在过去,传输函数(我稍后将在“ dev_queue_xmit 函数”一节中介绍)必须通过将帧放回队列(重新排队)来处理这个问题。现在,借助该_ _LINK_STATE_XOFF标志,可以避免这种额外的处理。当设备驱动程序意识到它没有足够的空间来存储最大大小(MTU)的帧时,它会停止出口队列netif_stop_queue。通过这种方式,可以避免内核已经知道将会失败的未来传输浪费资源。下面这个节流的例子at work 取自 vortex_start_xmitdrivers/net/3c59x.chard_start_xmit驱动程序使用的方法):

Why stop and start a queue once the device is running? One reason is that a device can temporarily use up its memory, thus causing a transmission attempt to fail. In the past, the transmitting function (which I introduce later in the section "dev_queue_xmit Function") would have to deal with this problem by putting the frame back into the queue (requeuing it). Now, thanks to the _ _LINK_STATE_XOFF flag, this extra processing can be avoided. When the device driver realizes that it does not have enough space to store a frame of maximum size (MTU), it stops the egress queue with netif_stop_queue. In this way, it is possible to avoid wasting resources with future transmissions that the kernel already knows will fail. The following example of this throttling at work is taken from vortex_start_xmit (the hard_start_xmit method used by the drivers/net/3c59x.c driver):

    outsl(ioaddr + TX_FIFO, skb->data, (skb->len + 3) >> 2);
    dev_kfree_skb (skb);
    if (inw(ioaddr + TxFree) > 1536) {
        netif_start_queue(开发);/* AKPM:多余?*/
    } 别的 {
        /* 当 FIFO 有空间容纳最大数据包时中断我们。*/
        netif_stop_queue(dev);
        outw(SetTxThreshold + (1536>>2), ioaddr + EL3_CMD);
    }
    outsl(ioaddr + TX_FIFO, skb->data, (skb->len + 3) >> 2);
    dev_kfree_skb (skb);
    if (inw(ioaddr + TxFree) > 1536) {
        netif_start_queue (dev);    /* AKPM: redundant? */
    } else {
        /* Interrupt us when the FIFO has room for max-sized packet. */
        netif_stop_queue(dev);
        outw(SetTxThreshold + (1536>>2), ioaddr + EL3_CMD);
    }

在 传输后不久outsl,代码会检查是否有空间容纳最大尺寸 ( ) 的帧1536netif_stop_queue如果没有,则用于停止设备的出口队列。这是一种相对粗糙的技术,用于避免由于内存不足而导致传输失败。当然,300字节的帧只要剩下300多字节就可以传输成功;因此,检查 1,536 字节可能会不必要地禁用传输。代码可以通过使用较低的值(例如 500)来妥协,但最终,增益不会那么大,并且当启用传输时较大的帧到达时可能会出现故障。

Shortly after the transmission by outsl, the code checks whether there is space for a frame of maximum size (1536), and uses netif_stop_queue to stop the device's egress queue if there is not. This is a relatively crude technique used to avoid transmission failures due to a shortage of memory. Of course, the transmission of a frame of 300 bytes would succeed when just a little more than 300 bytes are left; therefore, checking for 1,536 bytes could disable transmission unnecessarily. The code could compromise by using a lower value, such as 500, but in the end, the gain would not be that big and there could be failures when bigger frames arrive while transmission is enabled.

为了涵盖所有可能发生的情况,代码会netif_start_queue在设备上有足够内存时调用。的redundant?代码中的注释指的是在两种类型的中断上重新启动队列的做法。当设备指示它已完成传输并且指示其内存中有足够的空间用于另一个帧时,驱动程序请求重新启动队列。如果驱动程序仅对其中一个中断执行此操作,队列可能会立即重新启动,但这并不能保证。所以在这两种情况下都会发出重新启动队列的请求。

To cover all eventualities, the code calls netif_start_queue when there is enough memory on the device. The redundant? comment in the code refers to the practice of restarting the queue on two types of interrupts. The driver requests a restart to the queue when the device indicates that it has finished transmitting, and when it indicates that there is enough space in its memory for another frame. Probably, the queue would be restarted promptly if the driver did so on only one of these interrupts, but that's not guaranteed. So the request to restart the queue is issued under both circumstances.

该代码还会SetTxThreshold向设备发送一条命令,指示设备在给定量的内存(在本例中为 MTU 的大小)可用时生成中断。

The code also sends a SetTxThreshold command to the device, which instructs the device to generate an interrupt when a given amount of memory (the size of the MTU, in this case) becomes available.

您可能想知道在前面的场景中队列何时以及如何重新启用。对于 Vortex 驱动程序,它要求设备在给定数量的内存(在本例中为 MTU 的大小)可用时生成中断。这是处理此类中断的代码片段:

You may wonder when and how the queue will be re-enabled in the previous scenario. In the case of the Vortex driver, it asks the device to generate an interrupt when a given amount of memory (the size of the MTU, in this case) becomes available. This is the piece of code that handles such an interrupt:

static void vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
           …………
        if (状态 & TxAvailable) {
            如果(vortex_debug > 5)
                printk(KERN_DEBUG " TX 房间位已处理。\n");
            /* FIFO 中有足够的空间容纳全尺寸的数据包。*/
            outw(AckIntr | TxAvailable, ioaddr + EL3_CMD);
            netif_wake_queue(开发);
        }
           …………
}
static void vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
           ... ... ...
        if (status & TxAvailable) {
            if (vortex_debug > 5)
                printk(KERN_DEBUG "    TX room bit was handled.\n");
            /* There's room in the FIFO for a full-sized packet. */
            outw(AckIntr | TxAvailable, ioaddr + EL3_CMD);
            netif_wake_queue (dev);
        }
           ... ... ...
}

变量的位status代表卡生成中断的原因。该TxAvailable位指示空间可用,因此可以安全地唤醒设备(这称为唤醒队列,由 执行 netif_wake_queue)。诸如此类的值EL3_CMD只是ioaddr驱动程序在正确位置读取或写入网卡寄存器时使用的偏移量。

The bits of the status variable represent the reasons why the interrupt was generated by the card. The TxAvailable bit indicates that space is available and that it's therefore safe to wake up the device (this is called waking the queue, and is carried out by netif_wake_queue). Values such as EL3_CMD are simply offsets from ioaddr used by the driver to read or write the network card registers at the right positions.

netif_wake_queue请注意,出口队列是使用而不是重新启用的netif_start_queue。这个新函数(我们稍后将更详细地看到)不仅启用了出口队列,还要求内核检查该队列中是否有任何内容正在等待传输。原因是在队列被禁用期间,可能存在传输尝试。在这种情况下,它们将会失败,并且那些无法发送的帧将被放回到出口队列中。

Note that the egress queue is re-enabled with netif_wake_queue instead of netif_start_queue. That new function, which we will see later in more detail, not only enables the egress queue but also asks the kernel to check whether anything in that queue is waiting to be transmitted. The reason is that during the time the queue was disabled, there could have been transmission attempts. In this case, they would have failed, and those frames that could not be sent would have been put back into the egress queue.

安排设备传输

Scheduling a Device for Transmission

在描述入口路径时,我们看到,当设备接收到帧时,其驱动程序会调用一个内核函数(调用的函数取决于驱动程序是否使用 NAPI),该函数将设备添加到轮询列表并调度执行NET_RX_SOFTIRQ

When describing the ingress path, we saw that when a device receives a frame, its driver invokes a kernel function (the one invoked depends on whether the driver uses NAPI) that adds the device to a polling list and schedules the NET_RX_SOFTIRQ for execution.

出口路径上也会发生非常类似的情况。为了传输帧,内核提供了该dev_queue_xmit功能,稍后将在其自己的部分中进行描述。此函数将帧从设备的出口队列中出列,并将其提供给设备的hard_start_xmit方法。但是,dev_queue_xmit可能由于各种原因而无法传输,例如,因为设备的出口队列被禁用(如我们在上一节中看到的),或者因为设备队列上的锁已被占用。为了处理后一种情况,内核提供了一个名为 的函数来_ _netif_schedule调度设备进行传输(有点类似于netif_rx_schedule在接收路径上执行)。该函数永远不会直接调用,而是通过本节后面显示的两个包装器调用。

Something very similar happens on the egress path. To transmit frames, the kernel provides the dev_queue_xmit function, described later in its own section. This function dequeues a frame from the device's egress queue and feeds it to the device's hard_start_xmit method. However, dev_queue_xmit might not be able to transmit for various reasons—for instance, because the device's egress queue is disabled, as we saw in the previous section, or because the lock on the device queue is already taken. To handle the latter case, the kernel provides a function called _ _netif_schedule that schedules a device for transmission (somewhat similar to what netif_rx_schedule does on the reception path). This function is never called directly, but through two wrappers shown later in this section.

以下是include/linux/netdevice.h中函数的定义:

Here is the function's definition from include/linux/netdevice.h:

静态内联 void _ _netif_schedule(struct net_device *dev)
{
    if (!test_and_set_bit(_ _LINK_STATE_SCHED, &dev->state)) {
        无符号长标志;
        结构体softnet_data *sd;
 
        local_irq_save(标志);
           sd = &_ _get_cpu_var(softnet_data);
        dev->next_sched = sd->output_queue;
        sd->output_queue = dev;
        raise_softirq_irqoff(cpu, NET_TX_SOFTIRQ);
        local_irq_restore(标志);
    }
}
static inline void _ _netif_schedule(struct net_device *dev)
{
    if (!test_and_set_bit(_ _LINK_STATE_SCHED, &dev->state)) {
        unsigned long flags;
        struct softnet_data *sd;
 
        local_irq_save(flags);
           sd = &_ _get_cpu_var(softnet_data);
        dev->next_sched = sd->output_queue;
        sd->output_queue = dev;
        raise_softirq_irqoff(cpu, NET_TX_SOFTIRQ);
        local_irq_restore(flags);
    }
}

_ _netif_schedule完成两个主要任务:

_ _netif_schedule accomplishes two main tasks:

  • 它将设备添加到列表的头部output_queuepoll_list该列表与接待处使用的列表相对应。每个 CPU都有一个output_queue,就像poll_list每个 CPU 都有一个一样。但是,output_queue由 NAPI 和非 NAPI 设备使用,并且poll_list仅用于处理非 NAPI 设备。列表中的设备output_queue通过指针链接在一起net_device->next_sched您将在“处理 NET_TX_SOFTIRQ:net_tx_action ”部分中看到如何使用该列表。

    我们已经在第 9 章的“ softnet_data Structure ”部分中看到,它表示要发送内容的设备列表(因为它们在之前的尝试中失败,如“排队规则接口”部分中所述)或其出口队列已被重新发送。 - 禁用一段时间后启用。因为可以在内部和外部中断上下文中调用,所以它在将输入设备添加到列表时禁用中断。output_queue_ _netif_scheduleoutput_queue

  • It adds the device to the head of the output_queue list. This list is the counterpart to the poll_list list used by reception. There is one output_queue for each CPU, just as there is one poll_list for each CPU. However, output_queue is used by both NAPI and non-NAPI devices, and poll_list is used only to handle non-NAPI devices. The devices in the output_queue list are linked together with the net_device->next_sched pointer. You will see in the section "Processing the NET_TX_SOFTIRQ: net_tx_action" how that list is used.

    We already saw in the section "softnet_data Structure" in Chapter 9 that output_queue represents a list of devices that have something to send (because they failed on previous attempts, as described in the section "Queuing Discipline Interface") or whose egress queues have been re-enabled after having been disabled for a while. Because _ _netif_schedule may be called both inside and outside interrupt context, it disables interrupts while adding the input device to the output_queue list.

  • 它安排NET_TX_SOFTIRQ软中断的执行。_ _LINK_STATE_SCHED用于标记列表中的设备,output_queue因为它们有东西要发送。(_ _LINK_STATE_SCHED 是接收路径的对应部分_ _LINK_STATE_RX_SCHED。)请注意,如果设备已安排传输,_ _netif_schedule则不会执行任何操作。

  • It schedules the NET_TX_SOFTIRQ softirq for execution. _ _LINK_STATE_SCHED is used to mark devices that are in the output_queue list because they have something to send. (_ _LINK_STATE_SCHED is the counterpart of the reception path's _ _LINK_STATE_RX_SCHED.) Note that if the device was already scheduled for transmission, _ _netif_schedule would not do anything.

由于如果在设备上禁用传输,则调度设备进行传输是没有意义的,因此内核提供了两个可以使用的函数,两个函数都围绕_ _netif_schedule

Since it does not make sense to schedule a device for transmission if transmission is disabled on the device, the kernel provides two functions to be used instead, both wrappers around _ _netif_schedule:

netif_schedule [ * ]
netif_schedule [*]

只需确保在安排传输之前在设备上启用传输即可:

静态内联无效 netif_schedule(struct net_device *dev)
{
    if (!test_bit(_ _LINK_STATE_XOFF, &dev->state))
        _ _netif_schedule(dev);
}

Simply makes sure transmission is enabled on the device before scheduling it for transmission:

static inline void netif_schedule(struct net_device *dev)
{
    if (!test_bit(_ _LINK_STATE_XOFF, &dev->state))
        _ _netif_schedule(dev);
}
netif_wake_queue
netif_wake_queue

启用设备的传输,如果之前禁用了传输,则安排设备进行传输。需要进行此调度,因为在禁用设备队列时可能会进行传输尝试。我们在上一节中看到了它的使用示例。

静态内联无效 netif_wake_queue(struct net_device *dev)
{
    ...
    if (test_and_clear_bit(_ _LINK_STATE_XOFF, &dev->state))
        _ _netif_schedule(dev);
}

test_and_clear_bit_ _LINK_STATE_XOFF如果设置了标志,则清除该标志,并返回旧值。

Enables transmission for the device and, if transmission was previously disabled, schedules the device for transmission. This scheduling is needed because there could have been transmission attempts while the device queue was disabled. We saw an example of its use in the previous section.

static inline void netif_wake_queue(struct net_device *dev)
{
    ...
    if (test_and_clear_bit(_ _LINK_STATE_XOFF, &dev->state))
        _ _netif_schedule(dev);
}

test_and_clear_bit clears the _ _LINK_STATE_XOFF flag if it is set, and returns the old value.

请注意,调用netif_wake_queue相当于调用netif_start_queuenetif_schedule我在“启用和禁用传输”部分中说过,禁用和启用设备上的传输是驱动程序的责任,而不是高层函数的责任。通常,高级函数调度设备上的传输,并且设备驱动程序在需要时禁用和重新启用队列,例如处理内存短缺。netif_wake_queue因此,设备驱动程序使用的驱动程序以及 netif_schedule其他地方(例如,net_tx_action [ * ]和流量控制)使用的驱动程序应该不足为奇。

Note that a call to netif_wake_queue is equivalent to a call to both netif_start_queue and netif_schedule. I said in the section "Enabling and Disabling Transmissions" that it is the responsibility of the driver, not higher-layer functions, to disable and enable transmission on devices. Usually, high-level functions schedule transmissions on devices, and device drivers disable and re-enable the queue when required, such as to handle a shortage of memory. Therefore, it should not come as a surprise that netif_wake_queue is the one used by device drivers, and netif_schedule is the one used elsewhere (for example, by net_tx_action [*] and Traffic Control).

设备驱动程序netif_wake_queue在以下情况下使用:

A device driver uses netif_wake_queue in the following cases:

  • 我们将在“看门狗定时器”部分看到设备驱动程序使用看门狗定时器来从挂起的传输中恢复。在这种情况下,虚拟功能net_device->tx_timeout通常会重置卡。在设备不可用的黑洞期间,可能会有其他传输尝试,因此驱动程序需要首先启用设备的队列,然后安排设备进行传输。这同样适用于发出错误信号的中断(请参阅drivers/net/3c59x.c以获取一些示例)。

  • We will see in the section "Watchdog timer" that device drivers use a watchdog timer to recover from a transmission that hangs. In such a situation, the virtual function net_device->tx_timeout usually resets the card. During that black hole in which the device is not usable, there could be other transmission attempts, so the driver needs to first enable the device's queue and then schedule the device for transmission. The same applies to interrupts that signal error conditions (look at drivers/net/3c59x.c for some examples).

  • 当(如驱动程序本身先前的请求)设备向驱动程序发出信号,表明它有足够的内存来处理给定大小的帧的传输时,可以唤醒设备。我们已经在上一节中看到了与中断相关的这种做法的示例TxAvailable 。再次使用此功能的原因是,在驱动程序禁用队列期间,可能存在传输尝试。类似的考虑也适用于中断类型,该中断类型会在驱动程序到卡的 DMA 传输完成时通知驱动程序。

  • When (as previously requested by the driver itself) the device signals to the driver that it has enough memory to handle the transmission of a frame of a given size, the device can be awakened. We already saw an example of this practice in the previous section in relation to the TxAvailable interrupt. The reason for using this function, again, is that during the time the driver has disabled the queue, there could have been transmission attempts. A similar consideration applies to the interrupt type that tells the driver when a driver-to-card DMA transfer has completed.

排队规则界面

Queuing Discipline Interface

几乎所有设备都使用队列来调度出口流量,并且内核可以使用称为排队规则的算法来以最有效的传输顺序排列帧。尽管对流量控制及其排队规则的详细讨论超出了本书的范围,但在本节中我将简要概述本章讨论的设备驱动程序和传输层之间的接口。

Almost all devices use a queue to schedule egress traffic, and the kernel can use algorithms known as queuing disciplines to arrange the frames in the most efficient order for transmission. Although a detailed discussion of Traffic Control and its queuing disciplines is outside the scope of this book, in this section I'll provide a brief overview of the interface between device drivers and the transmission layer discussed in this chapter.

每个流量控制排队规则可以提供不同的函数指针,供高层调用以完成不同的任务。最重要的功能包括:

Each Traffic Control queuing discipline can provide different function pointers to be called by higher layers to accomplish different tasks. Among the most important functions are:

enqueue
enqueue

向队列添加一个元素

Adds an element to the queue

dequeue
dequeue

从队列中提取一个元素

Extracts an element from the queue

requeue
requeue

将先前提取的元素放回队列(例如,由于传输失败)

Puts back on the queue an element that was previously extracted (e.g., because of a transmission failure)

每当设备被调度进行传输时,该函数就会选择要传输的下一个帧qdisc_run,该函数间接调用dequeue相关排队规则的虚拟函数。

Whenever a device is scheduled for transmission, the next frame to transmit is selected by the qdisc_run function, which indirectly calls the dequeue virtual function of the associated queuing discipline.

再说一次,真正的工作实际上是由另一个函数 完成的qdisc_restart。该qdisc_run函数定义在include/linux/pkt_sched.h中,它只是一个包装器,用于过滤掉对出口队列被禁用的设备的请求:

Once again, the real job is actually done by another function, qdisc_restart. The qdisc_run function, defined in include/linux/pkt_sched.h, is simply a wrapper that filters out requests for devices whose egress queues are disabled:

静态内联无效 qdisc_run(struct net_device *dev)
{
    while (!netif_queue_stopped(dev) && qdisc_restart(dev) < 0)
        /* 没有什么 */;
}
static inline void qdisc_run(struct net_device *dev)
{
    while (!netif_queue_stopped(dev) && qdisc_restart(dev) < 0)
        /* NOTHING */;
}

qdisc_restart函数

qdisc_restart function

我们之前看到了设备被安排传输的常见情况。有时是因为出口队列中有东西正在等待传输。但在其他时候,设备会被调度,因为队列已被禁用一段时间,因此队列中可能有以前失败的传输尝试中等待的内容。司机不知道东西是否真的已经到达;它必须调度设备以防数据等待。如果实际上没有数据在等待,则对该dequeue方法的后续调用将失败。即使数据正在等待,调用也可能会失败,因为复杂的排队规则可能决定不传输任何数据。因此, qdisc_restart,定义为net/sched/sch_generic.c,根据dequeue 方法的返回值采取各种操作。

We saw earlier the common cases where a device is scheduled for transmission. Sometimes it is because something in the egress queue is waiting to be transmitted. But at other times, the device is scheduled because the queue has been disabled for a while and therefore there could be something waiting in the queue from previous failed transmission attempts. The driver does not know whether anything has actually arrived; it must schedule the device in case data is waiting. If in fact no data is waiting, the subsequent call to the dequeue method fails. Even if data is waiting, the call can fail because complex queuing disciplines may decide not to transmit any of the data. Therefore, qdisc_restart, defined in net/sched/sch_generic.c, takes various actions based on the return value of the dequeue method.

int qdisc_restart(struct net_device *dev)
{
    struct Qdisc *q = dev->qdisc;
    结构体sk_buff *skb;
 
    if ((skb = q->出队(q)) != NULL) {
int qdisc_restart(struct net_device *dev)
{
    struct Qdisc *q = dev->qdisc;
    struct sk_buff *skb;
 
    if ((skb = q->dequeue(q)) != NULL) {

dequeue函数在一开始就被调用。假设它成功了。传输一帧需要获取两个锁:

The dequeue function is called at the very start. Let's suppose it succeeded. Transmitting a frame requires the acquisition of two locks:

  • 保护队列 ( ) 的锁。这由( )dev->queue_lock)的调用者获取。qdisc_restartdev_queue_xmit

  • The lock that protects the queue (dev->queue_lock). This is acquired by the caller of qdisc_restart (dev_queue_xmit).

  • 驱动程序传输例程上的锁定hard_start_xmit( dev->xmit_lock)。锁由该函数管理。当设备驱动程序已经实现了自己的锁定时,它会通过设置标志 NETIF_F_LLTX(无锁传输功能)来指示这一点dev->features,以告诉上层也不需要获取锁dev->xmit_lock。使用允许内核通过在不需要时NETIF_F_LLTX不获取来优化传输数据路径。dev->xmit_lock当然,如果队列为空就不需要获取锁。

  • The lock on the driver's transmit routine hard_start_xmit (dev->xmit_lock). The lock is managed by this function. When the device driver already implements its own locking, it indicates this by setting the NETIF_F_LLTX flag (lockless transmission feature) in dev->features to tell the upper layers that there is no need to acquire the dev->xmit_lock lock as well. The use of NETIF_F_LLTX allows the kernel to optimize the transmit data path by not acquiring dev->xmit_lock when it is not needed. Of course, there is no need to acquire the lock if the queue is empty.

请注意,在使缓冲区出队后qdisc_restart不会立即释放,因为如果该函数无法获取驱动程序上的锁,则可能必须立即对缓冲区进行重新排队。queue_lock该函数queue_lock在获得驱动程序锁时释放,并queue_lock在返回之前重新获取。最终,dev_queue_xmit将负责释放它。

Note that qdisc_restart does not release the queue_lock immediately after dequeuing a buffer, because the function might have to requeue the buffer right away if it fails to acquire the lock on the driver. The function releases queue_lock when it has the driver lock in hand, and reacquires queue_lock before returning. Ultimately, dev_queue_xmit will take care of releasing it.

当驱动程序不支持NETIF_F_LLTX并且驱动程序锁已经被占用时(即spin_trylock返回0),传输失败。如果qdisc_restart无法获取驱动程序的锁,则意味着另一个CPU正在通过同一设备进行传输。在这种情况下,我们所qdisc_restart能做的就是将帧放回到队列中并重新安排设备进行传输,因为它不想等待。如果该函数运行在持有锁的同一CPU上,则检测到循环(即代码中的错误)并且帧被丢弃;否则,就只是碰撞。

When the driver does not support NETIF_F_LLTX and the driver lock is already taken (i.e., spin_trylock returns 0), transmission fails. If qdisc_restart fails to grab the lock on the driver, it means that another CPU is transmitting through the same device. All that qdisc_restart can do in this case is put the frame back into the queue and reschedule the device for transmission, since it does not want to wait. If the function is running on the same CPU that is holding the lock, a loop (i.e., a bug in the code) has been detected and the frame is dropped; otherwise, it is just a collision.

            if (!spin_trylock(&dev->xmit_lock)) {
            碰撞:
                ...
                转到重新排队;
            }
            ...
重新排队:
        q->ops->requeue(skb, q);
        netif_schedule(dev);
            if (!spin_trylock(&dev->xmit_lock)) {
            collision:
                ...
                goto requeue;
            }
            ...
requeue:
        q->ops->requeue(skb, q);
        netif_schedule(dev);

一旦成功获取驱动程序锁,队列上的锁就会被释放,以便其他CPU可以访问该队列。有时,不需要获取驱动程序锁,因为NETIF_F_LLTX已设置。无论哪种情况, qdisc_restart都已准备好开始其真正的工作。

Once the driver lock is successfully acquired, the lock on the queue is released so that other CPUs can access the queue. Sometimes, there is no need to acquire the driver lock because NETIF_F_LLTX is set. In either case, qdisc_restart is ready to start its real job.

            if (!netif_queue_stopped(dev)) {
                int ret;
                如果(netdev_nit)
                    dev_queue_xmit_nit(skb, dev);
 
                ret = dev->hard_start_xmit(skb, dev);
                如果(ret == NETDEV_TX_OK){
                    如果(!nolock){
                        dev->xmit_lock_owner = -1;
                        spin_unlock(&dev->xmit_lock);
                    }
                    spin_lock(&dev->queue_lock);
                    返回-1;
                }
                if (ret == NETDEV_TX_LOCKED && nolock) {
                    spin_lock(&dev->queue_lock);
                    转到碰撞;
                }
            }
            if (!netif_queue_stopped(dev)) {
                int ret;
                if (netdev_nit)
                    dev_queue_xmit_nit(skb, dev);
 
                ret = dev->hard_start_xmit(skb, dev);
                if (ret == NETDEV_TX_OK) {
                    if (!nolock) {
                        dev->xmit_lock_owner = -1;
                        spin_unlock(&dev->xmit_lock);
                    }
                    spin_lock(&dev->queue_lock);
                    return -1;
                }
                if (ret == NETDEV_TX_LOCKED && nolock) {
                    spin_lock(&dev->queue_lock);
                    goto collision;
                }
            }

我们在上一节中看到qdisc_run已经使用 检查了出口队列的状态netif_queue_stopped,但这里qdisc_restart再次检查它。第二项检查并不是多余的。考虑这种情况:当qdisc_run调用 时 netif_queue_stopped,驱动程序上的锁尚未被获取。当获得锁定时,另一个 CPU 可能已经发送了一些内容,并且卡可能已经耗尽了缓冲区空间。因此,netif_queue_stopped之前可能返回 FALSE,但现在返回 TRUE。

We saw in the previous section that qdisc_run has already checked the status of the egress queue with netif_queue_stopped, but here qdisc_restart checks it again. The second check is not superfluous. Consider this scenario: when qdisc_run called netif_queue_stopped, the lock on the driver was not taken yet. By the time the lock is taken, another CPU could have sent something and the card could have run out of buffer space. Therefore, netif_queue_stopped may have returned FALSE before but would now return TRUE.

netdev_nit代表注册的协议嗅探器的数量。如果有任何已注册,dev_queue_xmit_nit则用于向每个人传送帧的副本。(我们在第 10 章netif_receive_skb中看到了类似的接收内容。)

netdev_nit represents the number of protocol sniffers registered. If any are registered, dev_queue_xmit_nit is used to deliver a copy of the frame to each. (We saw something similar for reception in netif_receive_skb in Chapter 10.)

最后我们调用设备驱动程序的虚函数来进行帧传输。设备驱动程序提供的函数是 dev->hard_start_xmit,它是在初始化时为每个设备定义的(参见第8章)。NETDEV_TX_ XXX例程返回的值列hard_start_xmitinclude/linux/netdevice.h中。以下是qdisc_restart处理它们的方法:

Finally we get to the invocation of the device driver's virtual function for frame transmission. The function provided by the device driver is dev->hard_start_xmit, which is defined for each device at initialization time (see Chapter 8). The NETDEV_TX_ XXX values returned by hard_start_xmit routines are listed in include/linux/netdevice.h. Here is how qdisc_restart handles them:

NETDEV_TX_OK [ * ]
NETDEV_TX_OK [*]

传送成功。缓冲区尚未释放(kfree_skb未发出)。我们将在“处理 NET_TX_SOFTIRQ:net_tx_action ”一节中看到,驱动程序本身不会释放缓冲区,而是要求内核通过NET_TX_SOFTIRQ软中断来释放缓冲区。与每个驱动程序自己释放内存相比,这提供了更有效的内存处理。

The transmission succeeded. The buffer is not released yet (kfree_skb is not issued). We will see in the section "Processing the NET_TX_SOFTIRQ: net_tx_action" that the driver does not release the buffer itself but asks the kernel to do so by means of the NET_TX_SOFTIRQ softirq. This provides more efficient memory handling than if each driver did its own freeing.

NETDEV_TX_BUSY
NETDEV_TX_BUSY

驱动程序发现 NIC 的传输缓冲池中缺乏足够的空间。当检测到这种情况时,驾驶员通常也会发出呼叫(请参阅“启用和禁用传输netif_stop_queue”部分)。

The driver has discovered that the NIC lacks sufficient room in its transmit buffer pool. When this condition is detected, the driver often calls netif_stop_queue too (see the section "Enabling and Disabling Transmissions").

NETDEV_TX_LOCKED
NETDEV_TX_LOCKED

司机被锁了。此返回值仅由支持的驱动程序使用 NETIF_F_LLTX

The driver is locked. This return value is used only by drivers that support NETIF_F_LLTX.

总之,当满足以下条件之一时,传输失败并且必须将帧放回到队列中:

In summary, transmission fails and a frame must be put back onto the queue when one of the following conditions is true:

  • 队列已禁用(netif_queue_stopped(dev)为 true)。

  • The queue is disabled (netif_queue_stopped(dev) is true).

  • 另一个 CPU 正在锁定驱动程序。

  • Another CPU is holding the lock on the driver.

  • 司机失败了(hard_start_xmit没有返回NETDEV_TX_OK)。

  • The driver failed (hard_start_xmit did not return NETDEV_TX_OK).

该功能的详细信息见图11-2disc_restart

See Figure 11-2 for details of the disc_restart function.

dev_queue_xmit 函数

dev_queue_xmit Function

该函数是执行传输的设备驱动程序的接口。如图9-2中的第9章所示,可以通过两条备用路径导致dev_queue_xmit驱动程序传输函数的执行:hard_start_xmit

This function is the interface to the device driver that performs a transmission. As shown in Figure 9-2 in Chapter 9, dev_queue_xmit can lead to the execution of the driver transmit function hard_start_xmit through two alternate paths:

与流量控制(QoS 层)的接口
Interfacing to Traffic Control (the QoS layer)

这是通过qdisc_run我们在上一节中已经描述过的函数来完成的。

This is done through the qdisc_run function that we already described in the previous section.

直接调用hard_start_xmit
Invoking hard_start_xmit directly

这仅适用于不使用流量控制基础设施的设备(即虚拟设备)。

This is done only for devices that do not use the Traffic Control infrastructures (i.e., virtual devices).

我们很快就会研究这些案例,但让我们从两者共同的检查和任务开始。

We will look at these cases soon, but let's start with the checks and tasks common to both.

dev_queue_xmit调用时,传输帧所需的所有信息,例如输出设备、下一跳及其链路层地址,都已准备就绪。第六部分和第七部分描述了如何初始化这些参数。

When dev_queue_xmit is called, all the information required to transmit the frame, such as the outgoing device, the next hop, and its link layer address, is ready. Parts VI and VII describe how those parameters are initialized.

图11-3(a)11-3(b)描述了dev_queue_xmit

Figures 11-3(a) and 11-3(b) describe dev_queue_xmit.

dev_queue_xmit仅接收sk_buff结构作为输入。这包含函数所需的所有信息。skb->dev例如, 是传出设备,skb->data 指向有效负载的开头,其长度为skb->len

dev_queue_xmit receives only an sk_buff structure as input. This contains all the information the function needs. skb->dev, for instance, is the outgoing device, and skb->data points to the beginning of the payload, whose length is skb->len.

int dev_queue_xmit(结构sk_buff *skb)
int dev_queue_xmit(struct sk_buff *skb)

主要任务dev_queue_xmit是:

The main tasks of dev_queue_xmit are:

  • 检查帧是否由碎片组成以及设备是否可以通过分散/聚集 DMA 处理它们;如果设备无法这样做,则合并片段。 有关分段缓冲区的讨论,请参阅第 21 章。

  • Checking whether the frame is composed of fragments and whether the device can handle them through scatter/gather DMA; combining the fragments if the device is incapable of doing so. See Chapter 21 for a discussion of fragmented buffers.

  • 确保计算 L4 校验和(即 TCP/UDP),除非设备在硬件中计算校验和。有关校验和的更多详细信息,请参阅第 18 章。

  • Making sure the L4 checksum (that is, TCP/UDP) is computed, unless the device computes the checksum in hardware. See Chapter 18 for more details on checksumming.

  • 选择要传输的帧(输入指向的帧sk_buff可能不是要传输的帧,因为有一个队列需要遵守)。

  • Selecting which frame to transmit (the one pointed to by the input sk_buff may not be the one to transmit because there is a queue to honor).

skb_shinfo(skb)->frag_list在下面的代码中,当非NULL时,数据负载是一个片段列表;否则,有效负载是单个块。如果存在片段,代码会检查设备是否支持分散/聚集 DMA 功能,如果不支持,则将片段合并到单个缓冲区中。如果其中任何片段存储在地址太大而无法由设备寻址的内存区域中,则该函数还必须合并这些片段(即,如果为illegal_highdma(dev, skb)true)。[ * ]

In the following code, the data payload is a list of fragments when skb_shinfo(skb)->frag_list is non-NULL; otherwise, the payload is a single block. If there are fragments, the code checks whether scatter/gather DMA is a feature supported by the device, and if not, combines the fragments into a single buffer itself. The function must also combine the fragments if any of them are stored in a memory area whose address is too big to be addressed by the device (that is, if illegal_highdma(dev, skb) is true).[*]

    if (skb_shinfo(skb)->frag_list &&
        !(开发->功能&NETIF_F_FRAGLIST) &&
        _ _skb_线性化(skb, GFP_ATOMIC)) {
        转到out_kfree_skb;
    }
 
    if (skb_shinfo(skb)->nr_frags &&
        (!(dev->features&NETIF_F_SG) || invalid_highdma(dev, skb)) &&
        _ _skb_线性化(skb, GFP_ATOMIC)) {
        转到out_kfree_skb;
    }
    if (skb_shinfo(skb)->frag_list &&
        !(dev->features&NETIF_F_FRAGLIST) &&
        _ _skb_linearize(skb, GFP_ATOMIC)) {
        goto out_kfree_skb;
    }
 
    if (skb_shinfo(skb)->nr_frags &&
        (!(dev->features&NETIF_F_SG) || illegal_highdma(dev, skb)) &&
        _ _skb_linearize(skb, GFP_ATOMIC)) {
        goto out_kfree_skb;
    }

片段的碎片整理由 完成_ _skb_linearize,这可能会因以下原因之一而失败:

The defragmentation of fragments is done by _ _skb_linearize, which can fail for one of the following reasons:

  • 无法分配存储连接片段所需的新缓冲区。

  • The new buffer required to store the joined fragments failed to be allocated.

  • sk_buff缓冲区与其他一些子系统共享(即引用计数大于 1)。在这种情况下,该函数实际上并没有失败,而是通过调用 来生成警告BUG( )

  • The sk_buff buffer is shared with some other subsystems (that is, the reference count is bigger than one). In this case, the function does not actually fail, but generates a warning with a call to BUG( ).

L4 校验和可以通过软件和硬件计算。[ * ]并非所有网卡都可以在硬件中计算校验和;可以在设备初始化期间设置相关位标志的那些net_device->features。这告诉更高的网络层他们不需要担心校验和。如果出现以下情况,则必须在软件中计算校验和:

The L4 checksum can be calculated both in software and in hardware.[*] Not all network cards can compute the checksum in hardware; the ones that can will set the associated bit flag in net_device->features during device initialization. This tells higher network layers that they do not need to worry about checksumming. The checksum must instead be calculated in software if:

  • 不支持硬件校验和。

  • There is no support for hardware checksumming.

  • 该接口只能对 IP 上的 TCP/UDP 数据包使用硬件校验和,但正在传输的数据包不使用 IP 或使用 IP 上的其他 L4 协议。

  • The interface can use hardware checksumming only for TCP/UDP packets over IP, but the packet being transmitted does not use IP or uses another L4 protocol over IP.

软件校验和的计算方式为skb_checksum_help

The software checksum is calculated with skb_checksum_help:

    if (skb->ip_summed == CHECKSUM_HW &&
        (!(dev->功能 & (NETIF_F_HW_CSUM | NETIF_F_NO_CSUM)) &&
         (!(dev->功能 & NETIF_F_IP_CSUM) ||
          skb->协议!= htons(ETH_P_IP))))
        如果(skb_checksum_help(skb,0))
            转到out_kfree_skb;
    if (skb->ip_summed == CHECKSUM_HW &&
        (!(dev->features & (NETIF_F_HW_CSUM | NETIF_F_NO_CSUM)) &&
         (!(dev->features & NETIF_F_IP_CSUM) ||
          skb->protocol != htons(ETH_P_IP))))
        if (skb_checksum_help(skb, 0))
            goto out_kfree_skb;
qdisc_restart函数

图 11-2。qdisc_restart函数

Figure 11-2. qdisc_restart function

dev_queue_xmit 函数

图 11-3a。dev_queue_xmit 函数

Figure 11-3a. dev_queue_xmit function

一旦处理完校验和,所有标头就准备好了;下一步是决定传输哪个帧。

Once the checksum has been handled, all the headers are ready; the next step is to decide which frame to transmit.

此时,行为取决于设备是否使用流量控制基础设施并因此分配了排队规则。是的,这可能会让人感到惊讶。该函数刚刚处理了一个缓冲区(如果需要的话,对其进行碎片整理和校验和),但根据是否使用排队规则以及使用哪一个,以及传出队列的状态,该缓冲区可能不是实际使用的缓冲区接下来发送。

At this point, the behavior depends on whether the device uses the Traffic Control infrastructure and therefore has a queuing discipline assigned. Yes, this may come as a surprise. The function has just processed one buffer (defragmenting and checksumming it if needed) but depending on whether a queuing discipline is used and which one is used, and on the status of the outgoing queue, this buffer may not be the one that will actually be sent next.

排队设备

Queueful devices

当它存在时,可以通过 访问设备的排队规则dev->qdisc。输入帧使用enqueue虚拟功能进行排队,然后将一帧出队并通过 进行传输,详细信息请参见“排队规则接口qdisc_run”部分。

When it exists, the queuing discipline of the device is accessible through dev->qdisc. The input frame is queued with the enqueue virtual function, and one frame is then dequeued and transmitted via qdisc_run, described in detail in the section "Queuing Discipline Interface."

    local_bh_disable();
    local_bh_disable( );
dev_queue_xmit 函数

图 11-3b。dev_queue_xmit 函数

Figure 11-3b. dev_queue_xmit function

    q = rcu_dereference(dev->qdisc);
    ...
    if (q->入队) {
        spin_lock(&dev->queue_lock);
 
        rc = q->入队(skb, q);
 
        qdisc_run(dev);
 
        spin_unlock_bh(&dev->queue_lock);
        rc = rc == NET_XMIT_BYPASS ?NET_XMIT_SUCCESS:rc;
        转到出去;
    }
    q = rcu_dereference(dev->qdisc);
    ...
    if (q->enqueue) {
        spin_lock(&dev->queue_lock);
 
        rc = q->enqueue(skb, q);
 
        qdisc_run(dev);
 
        spin_unlock_bh(&dev->queue_lock);
        rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
        goto out;
    }

请注意,入队和出队都受到queue_lock队列上的锁的保护。软中断也可通过 禁用local_bh_disable,它还负责根据读取-复制-更新 (RCU) 的要求禁用抢占。

Note that both enqueuing and dequeuing are protected by the queue_lock lock on the queue. Softirqs are also disabled with local_bh_disable, which also takes care of disabling preemption as required by read-copy-update (RCU).

无队列设备

Queueless devices

有些设备(例如环回设备)没有队列:每当传输帧时,都会立即传递该帧。(但是因为没有地方可以重新排队它们,所以如果出现问题,帧就会被丢弃;它们不会有第二次机会。)如果您查看loopback_xmitdrivers /net/loopback.c,您将在最后看到一个直接调用至netif_rx,绕过所有排队业务。我们在 第 10 章中看到netif_rx是非 NAPI 设备驱动程序调用的 API,用于将传入帧放入输入队列并向更高层发出有关该事件的信号。由于环回设备没有输入队列,因此传输函数完成两个任务:一侧发送,另一侧接收,如图11-4所示。

Some devices, such as the loopback device, do not have a queue: whenever a frame is transmitted, it is immediately delivered. (But because there is no place to requeue them, frames are dropped if something goes wrong; they are not given a second chance.) If you look at loopback_xmit in drivers/net/loopback.c, you will see at the end a direct call to netif_rx, bypassing all the queuing business. We saw in Chapter 10 that netif_rx is the API called by non-NAPI device drivers to put an incoming frame into the input queue and signal higher layers about the event. Since there is no input queue for the loopback device, the transmission function accomplishes two tasks: transmit on one side and receive on the other, as shown in Figure 11-4.

(a) 队列设备传输; (b)环回传输

图 11-4。(a) 队列设备传输;(b)环回传输

Figure 11-4. (a) Queueful device transmission; (b) loopback transmission

最后一部分dev_queue_xmit用于处理没有排队规则的设备,因此没有出口队列。它与“排队规则接口qdisc_run”部分中介绍的行为非常相似。然而,在不使用队列的情况下,有两个区别:

The last part of dev_queue_xmit is used to handle devices without a queuing discipline and therefore without an egress queue. It closely resembles the behavior of qdisc_run covered in the section "Queuing Discipline Interface." There are, however, two differences in the case where no queue is used:

  • 当传输失败时,驱动程序无法将缓冲区放回任何队列,因为没有队列,因此缓冲区被丢弃dev_queue_xmit。如果较高层使用可靠的协议(例如 TCP),数据最终将被重传;否则,它将丢失。

  • When a transmission fails, the driver cannot put the buffer back into any queue because there is no queue, so the buffer is dropped by dev_queue_xmit. If the higher layers are using a reliable protocol such as TCP, the data will eventually be retransmitted; otherwise, it will be lost.

  • NETIF_F_LLTX “ qdisc_restart 函数”部分中介绍的功能由两个宏HARD_TX_LOCK 和负责HARD_TX_UNLOCKHARD_TX_LOCK使用spin_lock而不是spin_trylock:当驱动程序锁已被占用时,dev_queue_xmit旋转,等待它被释放。

  • The NETIF_F_LLTX feature introduced in the section "qdisc_restart function" is taken care of by the two macros HARD_TX_LOCK and HARD_TX_UNLOCK. HARD_TX_LOCK uses spin_lock rather than spin_trylock: when the driver lock is already taken, dev_queue_xmit spins, waiting for it to be released.

处理 NET_TX_SOFTIRQ:net_tx_action

Processing the NET_TX_SOFTIRQ: net_tx_action

我们在第 10 章中看到,该net_rx_action函数是与软件中断相关的处理程序NET_RX_SOFTIRQ。它由设备驱动程序触发(在某些特定条件下由其自身触发)并处理输入帧处理的部分这被设备驱动程序推迟到“中断处理阶段之后”。这样,驱动程序在中断上下文中执行的代码仅执行严格必要的操作(复制内存中的数据并通过生成软件中断向内核发出其存在的信号),并且不会强制系统的其余部分等待很长时间; 随后,软件中断负责处理可以等待的帧处理部分。

We saw in Chapter 10 that the net_rx_action function is the handler associated with NET_RX_SOFTIRQ software interrupts. It is triggered by device drivers (and by itself under some specific conditions) and handles the part of the input frame processing that is postponed by device drivers to the "after interrupt handling phase." In this way, the code executed in interrupt context by the driver does only what is strictly necessary (copy the data in memory and signal the kernel about its existence by generating a software interrupt) and does not force the rest of the system to wait long; later on, the software interrupt takes care of that part of the frame processing that can wait.

net_tx_action以类似的方式工作。它可以由raise_softirq_irqoff(NET_TX_SOFTIRQ) 两个不同上下文中的设备触发,以完成两个主要任务:

net_tx_action works in a similar way. It can be triggered with raise_softirq_irqoff(NET_TX_SOFTIRQ) by devices in two different contexts, to accomplish two main tasks:

  • netif_wake_queue设备上启用传输时。在这种情况下,它确保在满足所有所需条件时(例如,当设备有足够的内存时)实际发送等待发送的帧。

  • By netif_wake_queue when transmission is enabled on a device. In this case, it makes sure that frames waiting to be sent are actually sent when all the needed conditions are met (for instance, when the device has enough memory).

  • dev_kfree_skb_irq传输完成并且设备驱动程序用前一个例程发出信号时,可以释放关联的缓冲区。在这种情况下,它会释放sk_buff与成功传输的缓冲区关联的结构。

  • By dev_kfree_skb_irq when a transmission has completed and the device driver signals with the former routine that the associated buffer can be released. In this case, it deallocates the sk_buff structures associated with successfully transmitted buffers.

第二个任务的原因如下。我们知道,当设备驱动程序的代码在中断上下文中运行时,它需要尽可能快。释放缓冲区可能需要时间,因此可以通过要求net_tx_action软中断来处理它来推迟释放缓冲区。dev_kfree_skb设备驱动程序不使用 using,而是使用dev_kfree_skb_irq. 虽然前者释放了sk_buff(实际上由返回到每个 CPU 缓存的缓冲区组成),但后者只是将指向正在释放的缓冲区的指针添加到与 CPU关联的结构 completion_queue列表中,然后让我们稍后做真正的工作。softnet_datanet_tx_action

The reason for the second task is as follows. We know that when code from the device driver runs in interrupt context, it needs to be as quick as possible. Releasing a buffer can take time, so it is deferred by asking the net_tx_action softirq to take care of it. Instead of using dev_kfree_skb, device drivers use dev_kfree_skb_irq. While the former deallocates the sk_buff (which actually consists of the buffer going back into a per-CPU cache), the latter simply adds the pointer to the buffer being released to the completion_queue list of the softnet_data structure associated with the CPU and lets net_tx_action do the real job later.

让我们看看它是如何net_tx_action完成这两项任务的。

Let's see how net_tx_action accomplishes its two tasks.

completion_queue它首先释放通过设备驱动程序调用 添加到列表中的所有缓冲区dev_kfree_skb_irq。由于net_tx_action在中断上下文之外运行,设备驱动程序可以随时向列表添加元素,因此net_tx_action在访问结构时必须禁用中断softnet_data。为了尽可能少地禁用中断,它通过设置completion_queue为 NULL 来清除列表,并将指向列表的指针保存在本地变量中clist,其他人无法访问该变量(另请注意,每个 CPU 都有自己的列表)。这样,它可以遍历列表并使用 释放每个元素_ _kfree_skb,而驱动程序可以继续向 中添加新元素completion_queue

It starts by deallocating all the buffers that have been added to the completion_queue list by the device drivers' calls to dev_kfree_skb_irq. Because net_tx_action is running outside interrupt context, a device driver could add elements to the list at any time, so net_tx_action must disable interrupts while accessing the softnet_data structure. To keep interrupts disabled as little as possible, it clears the list by setting completion_queue to NULL and saves the pointer to the list in a local variable clist, which no one else can access (note also that each CPU has its own list). This way, it can walk through the list and free each element with _ _kfree_skb, while drivers can continue adding new elements to completion_queue.

    if (sd->completion_queue) {
        结构 sk_buff *clist;
 
        local_irq_disable();
        clist = sd->completion_queue;
        sd->completion_queue = NULL;
        local_irq_enable();
 
        while (clist!= NULL) {
            struct sk_buff *skb = clist;
            clist = clist->下一个;
 
            BUG_TRAP(!atomic_read(&skb->users));
            _ _kfree_skb(skb);
        }
    }
    if (sd->completion_queue) {
        struct sk_buff *clist;
 
        local_irq_disable( );
        clist = sd->completion_queue;
        sd->completion_queue = NULL;
        local_irq_enable( );
 
        while (clist != NULL) {
            struct sk_buff *skb = clist;
            clist = clist->next;
 
            BUG_TRAP(!atomic_read(&skb->users));
            _ _kfree_skb(skb);
        }
    }

该函数的后半部分(传输帧)的工作原理类似:它使用局部变量来保证免受硬件中断的影响。请注意,对于每个设备,在传输任何内容之前,该函数需要获取输出设备队列 ( ) 上的锁dev->queue_lock。如果该函数无法获取锁(因为另一个 CPU 持有该锁),它只会重新安排设备进行传输netif_schedule

The second half of the function, which transmits frames, works similarly: it uses a local variable to remain safe from hardware interrupts. Note that for each device, before transmitting anything, the function needs to grab the lock on the output device's queue (dev->queue_lock). If the function fails to grab the lock (because another CPU holds it), it simply reschedules the device for transmission with netif_schedule.

    如果(sd->输出队列){
        结构体net_device *头;
 
        local_irq_disable();
        头= sd->输出队列;
        sd->output_queue = NULL;
        local_irq_enable();
 
        而(头){
            结构体net_device *dev = head;
            头=头->next_sched;
 
            smp_mb_ _before_clear_bit( );
            清除位(_ _LINK_STATE_SCHED, &dev->状态);
 
            if (spin_trylock(&dev->queue_lock)) {
                qdisc_run(dev);
                spin_unlock(&dev->queue_lock);
            } 别的 {
                netif_schedule(dev);
            }
        }
    }
    if (sd->output_queue) {
        struct net_device *head;
 
        local_irq_disable( );
        head = sd->output_queue;
        sd->output_queue = NULL;
        local_irq_enable( );
 
        while (head) {
            struct net_device *dev = head;
            head = head->next_sched;
 
            smp_mb_ _before_clear_bit( );
            clear_bit(_ _LINK_STATE_SCHED, &dev->state);
 
            if (spin_trylock(&dev->queue_lock)) {
                qdisc_run(dev);
                spin_unlock(&dev->queue_lock);
            } else {
                netif_schedule(dev);
            }
        }
    }

我们已经在“排队规则接口”部分中看到了如何qdisc_run工作。设备从列表头部开始按顺序处理。由于netif_schedule函数(内部调用_ _netif_schedule)在列表头部添加元素,因此设备按后进先出 (LIFO) 顺序提供服务,这在某些情况下可能不公平。

We already saw in the section "Queuing Discipline Interface" how qdisc_run works. Devices are handled in a sequential order starting from the head of the list. Because the netif_schedule function (calling _ _netif_schedule internally) adds elements at the head of the list, devices are served in Last In, First Out (LIFO) order, which in some conditions may be unfair.

这样就完成了net_tx_action功能;让我们看一下可以调用它来释放缓冲区的一些上下文。一些需要释放缓冲区的函数可以在不同的上下文中调用,无论是在中断上下文内部还是外部。可以使用包装器来优雅地处理这些情况:

That completes the net_tx_action function; let's look at some contexts where it can be invoked to free buffers. Some functions that desire to release a buffer can be invoked in different contexts, inside or outside interrupt context. A wrapper is available to handle these cases elegantly:

静态内联 void dev_kfree_skb_any(struct sk_buff *skb)
{
    if (in_irq( ) || irqs_disabled( ))
        dev_kfree_skb_irq(skb);
    别的
        dev_kfree_skb(skb);
}
static inline void dev_kfree_skb_any(struct sk_buff *skb)
{
    if (in_irq( ) || irqs_disabled( ))
        dev_kfree_skb_irq(skb);
    else
        dev_kfree_skb(skb);
}

dev_kfree_skb_irq函数在调用函数处于中断上下文中时运行,如下所示:

The dev_kfree_skb_irq function runs when the calling function is in interrupt context, and looks like this:

静态内联 void dev_kfree_skb_irq(struct sk_buff *skb)
{
    if (atomic_dec_and_test(&skb->users)) {
        结构体softnet_data *sd;
        无符号长标志;
 
        local_irq_save(标志);
           sd = &_ _get_cpu_var(softnet_data);
        skb->next = sd->completion_queue;
        sd->completion_queue = skb;
        raise_softirq_irqoff(NET_TX_SOFTIRQ);
        local_irq_restore(标志);
    }
}
static inline void dev_kfree_skb_irq(struct sk_buff *skb)
{
    if (atomic_dec_and_test(&skb->users)) {
        struct softnet_data *sd;
        unsigned long flags;
 
        local_irq_save(flags);
           sd = &_ _get_cpu_var(softnet_data);
        skb->next = sd->completion_queue;
        sd->completion_queue = skb;
        raise_softirq_irqoff(NET_TX_SOFTIRQ);
        local_irq_restore(flags);
    }
}

仅当没有其他引用缓冲区时(即,如果 skb->users为 0),才能释放缓冲区。

A buffer can be freed only if there are no other references to it (that is, if skb->users is 0).

让我们看一个示例,了解如何通过设备驱动程序的net_tx_action间接调用来触发的执行。(另一个示例可以在“启用和禁用传输cpu_raise_softirq(cpu, NET_TX_SOFTIRQ)”部分中找到。)

Let's see an example of how the execution of net_tx_action is triggered by an indirect call to cpu_raise_softirq(cpu, NET_TX_SOFTIRQ) by a device driver. (Another example can be found in the section "Enabling and Disabling Transmissions.")

我们前面介绍的drivers/net/3c59x.cvortex_interrupt中的函数处理的中断类型中有一个是由设备调用的中断,用于告诉驱动程序从 CPU 到设备的 DMA 传输已完成 ( )。由于缓冲区已传输到设备,因此现在可以释放该结构。由于中断处理程序在中断上下文中运行,因此驱动程序调用.DMADonesk_buffdev_kfree_skb_irq

Among the interrupt types handled by the vortex_interrupt function in drivers/net/3c59x.c we introduced earlier is an interrupt invoked by the device to tell the driver that a DMA transfer from the CPU to the device is completed (DMADone). Since the buffer has been transferred to the device, the sk_buff structure can now be freed. Because the interrupt handler is running in interrupt context, the driver calls dev_kfree_skb_irq.

if (状态 & DMADone) {
    if (inw(ioaddr + Wn7_MasterStatus) & 0x1000) {
        outw(0x1000, ioaddr + Wn7_MasterStatus); /* 确认事件。*/
        pci_unmap_single(VORTEX_PCI(vp), vp->tx_skb_dma,
                 (vp->tx_skb->len + 3) & ~3, PCI_DMA_TODEVICE);
        dev_kfree_skb_irq(vp->tx_skb); /* 释放传输的缓冲区 */
        if (inw(ioaddr + TxFree) > 1536) {
            netif_wake_queue(dev);
        } else { /* 当 FIFO 有空间容纳最大数据包时中断。*/
            outw(SetTxThreshold + (1536>>2), ioaddr + EL3_CMD);
            netif_stop_queue(dev);
        }
    }
}
if (status & DMADone) {
    if (inw(ioaddr + Wn7_MasterStatus) & 0x1000) {
        outw(0x1000, ioaddr + Wn7_MasterStatus); /* Ack the event. */
        pci_unmap_single(VORTEX_PCI(vp), vp->tx_skb_dma,
                 (vp->tx_skb->len + 3) & ~3, PCI_DMA_TODEVICE);
        dev_kfree_skb_irq(vp->tx_skb); /* Release the transferred buffer */
        if (inw(ioaddr + TxFree) > 1536) {
            netif_wake_queue(dev);
        } else { /* Interrupt when FIFO has room for max-sized packet. */
            outw(SetTxThreshold + (1536>>2), ioaddr + EL3_CMD);
            netif_stop_queue(dev);
        }
    }
}

看门狗定时器

Watchdog timer

我们在“启用和禁用传输”部分中看到,当满足某些条件时,设备驱动程序可以禁用传输。传输的禁用应该是暂时的,因此当在合理的时间内没有重新启用传输时,内核会认为设备遇到了一些问题并且应该重新启动。

We saw in the section "Enabling and Disabling Transmissions" that transmission can be disabled by a device driver when certain conditions are met. The disabling of transmission is supposed to be temporary, so when transmission is not re-enabled within a reasonable amount of time, the kernel assumes the device is experiencing some problems and should be restarted.

这是通过每个设备的计时器来实现的, dev_watchdog_up该计时器在使用 激活设备时 启动dev_activate。计时器定期到期,确保设备一切正常,然后自行重新启动。当它检测到问题时——因为设备的出口队列被禁用(netif_queue_stopped返回 TRUE)并且自上一帧传输发生以来已经过去了太多时间——计时器的处理程序调用由设备驱动程序注册的例程,该例程将重置 NIC。

This is achieved by a per-device timer that is started with dev_watchdog_up when the device is activated with dev_activate. The timer regularly expires, makes sure everything is OK with the device, and restarts itself. When it detects a problem—because the device's egress queue is disabled (netif_queue_stopped returns TRUE) and too much time has passed since the last frame transmission took place—the timer's handler invokes a routine registered by the device driver, which resets the NIC.

以下是net_device用于实现此机制的字段:

Here are the net_device fields used to implement this mechanism:

trans_start
trans_start

这是最后一帧传输开始时设备驱动程序初始化的时间戳。

This is the timestamp initialized by the device driver when the last frame transmission started.

watchdog_timer
watchdog_timer

这是由流量控制启动的计时器。定时器到期时执行的处理程序是net/sched/sch_generic.cdev_watchdog中定义的 。

This is the timer started by Traffic Control. The handler executed when the timer expires is dev_watchdog, defined in net/sched/sch_generic.c.

watchdog_timeo
watchdog_timeo

这是等待的时间。这是由设备驱动程序初始化的。当设置为0时,watchdog_timer 不启动。

This is the amount of time to wait. This is initialized by the device driver. When it is set to 0, watchdog_timer is not started.

tx_timeout
tx_timeout

这是设备驱动程序提供的例程,将调用 dev_watchdog该例程来重置设备。

This is the routine provided by the device driver that will be invoked by dev_watchdog to reset the device.

当定时器到期时,内核处理dev_watchdog程序通过调用指向的函数来采取行动tx_timeout。后者通常会重置卡并使用 重新启动接口调度程序netif_wake_queue

When the timer expires, the kernel handler dev_watchdog takes action by calling the function to which tx_timeout points. The latter normally resets the card and restarts the interface scheduler with netif_wake_queue.

的正确值watchdog_timeo取决于接口。如果驱动程序没有设置,则默认为5秒。定义值时要考虑的参数是:

The proper value for watchdog_timeo depends on the interface. If the driver does not set it, it defaults to 5 seconds. The parameters to take into account when defining the value are:

传输冲突的可能性
The likelihood of transmission collisions

对于点对点链路,该值为零,但对于插入集线器的共享和过载以太网链路,该值可能很高。

This is zero for point-to-point links, but can be high on shared and overloaded Ethernet links plugged into hubs.

接口速度
The interface speed

接口越慢,超时应该越大。

The slower the interface, the bigger the timeout should be.

的值watchdog_timeo通常定义为变量 的倍数HZ,代表 1 秒。HZ是一个全局变量,其值取决于平台(它在与体系结构相关的文件 include/asm-XXX/param.h中定义)。正如您在表 11-1中看到的 ,即使是相同类型的设备也可能采用不同的超时值。该表仅列出了几个示例;这不是一个完整的列表。

The value of watchdog_timeo is usually defined as a multiple of the variable HZ, which represents 1 second. HZ is a global variable whose value depends on the platform (it is defined in the architecture-dependent file include/asm-XXX/param.h). As you can see in Table 11-1, even devices of the same type may take different values for the timeout. The table lists only a few examples; it is not a complete list.

表 11-1。最常见网卡使用的传输超时

Table 11-1. Transmission timeout used by the most common network cards

设备驱动

Device driver

watchdog_timeo(使用超时)

watchdog_timeo (timeout used)

3C501

3c501

赫兹

HZ

3C505

3c505

10*赫兹

10*HZ

3C509

3c509

(400*赫兹)/1000

(400*HZ)/1000

3C515

3c515

(400*赫兹)/1000

(400*HZ)/1000

3c523

3c523

赫兹

HZ

3c527

3c527

5*赫兹

5*HZ

3c59x

3c59x

5*赫兹

5*HZ

dl2k

dl2k

4*赫兹

4*HZ

纳采米

Natsemi

2*赫兹

2*HZ

空中网4500

Aironet 4500

8*赫兹

8*HZ

s2io(10Gbit)

s2io (10Gbit)

5*赫兹

5*HZ

8390

8390

(20*赫兹)/100

(20*HZ)/100

8139太

8139too

6*赫兹

6*HZ

b44

b44

5*赫兹

5*HZ

TG3

tg3

5*赫兹

5*HZ

e100

e100

2*赫兹

2*HZ

e1000

e1000

5*赫兹

5*HZ

SIS 900

SIS 900

4*赫兹

4*HZ

郁金香家族

Tulip family

4*赫兹

4*HZ

英特尔 EtherExpress 16

Intel EtherExpress 16

2*赫兹

2*HZ

SLIP

20*赫兹

20*HZ

看门狗定时器机制由流量控制代码提供。然而,高级设备驱动程序也可以实现自己的看门狗定时器。有关示例,请参阅 drivers/net/e1000_main.c 。

The watchdog timer mechanism is provided by the Traffic Control code. However, advanced device drivers may implement their own watchdog timers, too. See drivers/net/e1000_main.c for an example.




[ * ]列表中的其他标志在第 8 章和第 10章中进行了描述。

[*] The other flags in the list are described in Chapters 8 and 10.

[ * ]为了保持一致性,netif_tx_schedule可能是一个更好的名称。

[*] For consistency, netif_tx_schedule would probably have been a better name.

[ * ]net_tx_action当设备无法获取dev->queue_lock设备出口队列上的锁并因此无法传输时,安排设备进行传输。

[*] net_tx_action schedules a device for transmission when it cannot grab the dev->queue_lock lock on the device's egress queue and therefore cannot transmit.

[ * ]这些NETDEV_TX_ XXX值是最近在内核版本 2.6.9 中引入的。在引入之前,hard_start_xmit函数通常在成功时返回 0,在出错时返回 1(例如,如果 NIC 内存中没有空间)。到目前为止,只有少数驱动程序已更新为使用这些NETDEV_TX_ XXX值(主要是支持的驱动程序NETIF_F_LLTX);所有其他仍然直接使用值 0 和 1。

[*] The NETDEV_TX_ XXX values were introduced relatively recently in kernel version 2.6.9. Before their introduction, hard_start_xmit functions used to just return 0 in case of success and 1 in case of error (e.g., if there was no room in the NIC's memory). So far, only a few drivers have been updated to use the NETDEV_TX_ XXX values (mainly those that support NETIF_F_LLTX); all the others still use the values 0 and 1 directly.

[ * ]某些设备只能使用 16 位地址,这限制了可寻址内存的部分。

[*] Some devices can use only 16-bit addresses, which constrains the portion of addressable memory.

[ * ]每个协议用来计算校验和的算法在相关章节中进行了分析。

[*] The algorithm used by each protocol to compute the checksum is analyzed in the associated chapters.

第 12 章关于中断的一般和参考资料

Chapter 12. General and Reference Material About Interrupts

本章包含几种通用类型的信息,适用于前三章中有关中断和帧处理的材料。

This chapter contains several general types of information that apply to the material presented in the previous three chapters on interrupts and frame handling.

统计数据

Statistics

帧接收统计保存在 per-CPU array 中netdev_rx_stat,其元素类型为netif_rx_stats(请参阅include/linux/netdevice.h):

Statistics about frame reception are kept in the per-CPU array netdev_rx_stat, whose elements are of type netif_rx_stats (see include/linux/netdevice.h):

结构netif_rx_stats netdev_rx_stat [NR_CPUS];
 
结构体netif_rx_stats
{
        未签名总数;
        未签名丢弃;
        无符号时间挤压;
        未签名的节流;
        无符号 fastroute_hit;
        未签名的快速路由_成功;
        无符号 fastroute_defer;
        无符号 fastroute_deferred_out;
        无符号 fastroute_latency_reduction;
        无符号 cpu_collision;
} __ _cacheline_aligned;
 
struct netif_rx_stats netdev_rx_stat[NR_CPUS];
 
struct netif_rx_stats
{
        unsigned total;
        unsigned dropped;
        unsigned time_squeeze;
        unsigned throttled;
        unsigned fastroute_hit;
        unsigned fastroute_success;
        unsigned fastroute_defer;
        unsigned fastroute_deferred_out;
        unsigned fastroute_latency_reduction;
        unsigned cpu_collision;
} __ _  _cacheline_aligned;
 

的要素netif_rx_stats是:

The elements of netif_rx_stats are:

total
total

已处理的入口帧总数,包括任何可能被丢弃的帧。该值在netif_rx和 中 都会更新netif_receive_skb,这意味着当驱动程序不使用 NAPI(即,它使用接口netif_rx;参见第 10 章中的图 10-2)时,(错误地)同一帧被占用两次。

Total number of ingress frames processed, including any that might be discarded. This value is updated both in netif_rx and in netif_receive_skb, which means that (by mistake) the same frame is accounted for twice when the driver does not use NAPI (i.e., it uses the netif_rx interface; see Figure 10-2 in Chapter 10).

dropped
dropped

由于 CPU 处于节流状态时接收帧而被丢弃的帧数。

Number of frames that were dropped because they were received when the CPU was in the throttle state.

time_squeeze
time_squeeze

当帧仍在 CPU 入口队列中时必须返回的次数net_rx_action,以免成为 CPU 占用者。请参阅第 10 章中的“处理 NET_RX_SOFTIRQ:net_rx_action ”部分。

Number of times net_rx_action had to return while frames were still in the CPU ingress queue, so as not to become a CPU hog. See the section "Processing the NET_RX_SOFTIRQ: net_rx_action" in Chapter 10.

throttled
throttled

CPU 进入节流状态的次数。该值增加netif_rx

Number of times the CPU went into the throttle state. This value is incremented by netif_rx.

fastroute_hit
fastroute_hit

fastroute_success
fastroute_success

fastroute_defer
fastroute_defer

fastroute_latency_reduction
fastroute_latency_reduction

fastroute_deferred_out
fastroute_deferred_out

Fastroute 功能曾经使用的字段。该功能在内核 2.6.8 中已被删除。

Fields that used to be used by the Fastroute feature. This feature has been dropped in kernel 2.6.8.

cpu_collision
cpu_collision

dev->xmit_lock由于锁已被另一个 CPU占用,CPU 无法获取设备驱动程序(更准确地说,在 )上的锁的次数。该计数器在 中更新qdisc_restart,它仅处理帧传输,而不处理接收。cpu_collision是该结构中包含的唯一有关传输的统计数据。

Number of times the CPU failed to grab the lock on a device driver (more precisely, on dev->xmit_lock) because the lock was already taken by another CPU. This counter is updated in qdisc_restart, which handles only frame transmission, not reception. cpu_collision is the only statistic about transmission that has been included in this structure.

事实上,前面的一些计数器当前仅由netif_rx(仅由非 NAPI 驱动程序使用)更新,这意味着在使用 NAPI 驱动程序时它们的值不正确。

The fact that some of the preceding counters are currently updated only by netif_rx (which is used only by non-NAPI drivers), means that their values are not correct when using NAPI drivers.

向量的内容可以通过/procnetdev_rx_stat接口查看。请参阅下一节。

The contents of the netdev_rx_stat vector can be viewed via the /proc interface. See the next section.

其他统计信息由驱动程序、高层协议和流量控制排队规则保存在私有数据结构中(参见第 2 章)。其中一些值可以使用用户空间应用程序(例如ifconfigtcipnetstat )读取,其他值也可以通过/proc导出。

Other statistics are kept by the driver in private data structures (see Chapter 2), by higher-layer protocols, and by the Traffic Control queuing disciplines. Some of those values can be read with user-space applications such as ifconfig, tc, ip, or netstat, and others are also exported via /proc.

通过 /proc 和 sysfs 文件系统进行调整

Tuning via /proc and sysfs Filesystems

表 12-1中列出的/proc/sys/net/core中的所有文件都在net/core/sysctl_net_core.c中定义,您可以在其中找到文件和内核变量之间的关联。

All of the files in /proc/sys/net/core listed in Table 12-1 are defined in net/core/sysctl_net_core.c, where you can find the association between files and kernel variables.

表 12-1。/proc/sys/net/core/ 可用于调整帧接收的文件

Table 12-1. /proc/sys/net/core/ files usable for tuning frame reception

文件名

Filename

内核变量a

Kernel variablea

默认值

Default value

a所有这些变量都在net/core/dev.c中定义。

a All of these variables are defined in net/core/dev.c.

netdev_max_backlog

netdev_max_backlog

netdev_max_backlog

netdev_max_backlog

300

300

mod_cong

mod_cong

mod_cong

mod_cong

290

290

洛聪

lo_cong

lo_cong

lo_cong

100

100

无丛

no_cong

no_cong

no_cong

20

20

no_cong_thresh

no_cong_thresh

no_cong_thresh

no_cong_thresh

10

10

开发权重

dev_weight

weight_p

weight_p

64

64

我想强调的是,NAPI 驱动程序不需要表 12-1中的任何字段。NAPI 驱动程序应net_device->weight使用本地(对于驱动程序)值而不是weight_p. 但如果他们愿意,他们可以使用 ,特别是因为他们通常使用相同的默认值 64。从内核 2.6.12 开始,可以通过每个设备文件/sys/class在运行时使用sysfsweight_p调整 字段的值 /网/ / . 权重文件在net/core/net-sysfs.c中创建。net_deviceweight device_name weight

I would like to stress that NAPI drivers do not need any of the fields in Table 12-1. NAPI drivers are expected to initialize net_device->weight using local (to the driver) values rather than weight_p. But they could use weight_p if they wanted, particularly because they usually use the same default value of 64. Starting with kernel 2.6.12, the value of net_device's weight field can be tuned at runtime with sysfs via the per-device files /sys/class/net/ device_name / weight. The weight file is created in net/core/net-sysfs.c.

使用“统计netdev_rx_stats” 部分中描述的结构收集的统计信息可以通过文件/proc/net/softnet_stat读取(输出为十六进制)。

The statistics collected with the netdev_rx_stats structures described in the section "Statistics" can be read via the file /proc/net/softnet_stat (the output is in hexadecimal).

本书这一部分介绍的函数和变量

Functions and Variables Featured in This Part of the Book

表12-2 总结了前三章中介绍或引用的主要函数、变量和数据结构。其他的可以在第 9 章的表 9-1中找到。

Table 12-2 summarizes the main functions, variables, and data structures introduced or referenced in the previous three chapters. Additional ones can be found in Table 9-1 in Chapter 9.

表 12-2。与中断和帧处理相关的函数、变量和数据结构

Table 12-2. Functions, variables, and data structures related to interrupts and frame handling

姓名

Name

描述

Description

Functions

Functions

 

netif_rx

netif_rx

将输入帧放入 CPU 队列中。请参阅第 10 章中的“旧驱动程序接口与新驱动程序接口”部分。

Queues an input frame into a CPU's queue. See the section "Old Versus New Driver Interfaces" in Chapter 10.

netif_rx_schedule _ _netif_rx_schedule

netif_rx_schedule _ _netif_rx_schedule

安排NET_RX_SOFTIRQ软件中断执行。请参阅第 10 章中的“旧驱动程序接口与新驱动程序接口”部分。

Schedules the NET_RX_SOFTIRQ software interrupt for execution. See the section "Old Versus New Driver Interfaces" in Chapter 10.

netif_rx_complete

netif_rx_complete

net_device->poll当虚函数清除队列时由虚函数调用。

Called by the net_device->poll virtual function when the latter has cleared the queue.

netif_start_queue netif_stop_queue

netif_start_queue netif_stop_queue

分别启用和禁用设备上的传输。请参阅第 11 章中的“启用和禁用传输”部分。

Enables and disables transmission on a device, respectively. See the section "Enabling and Disabling Transmissions" in Chapter 11.

netif_queue_stopped

netif_queue_stopped

检查设备是否启用传输。

Checks whether a device is enabled for transmission.

netif_schedule [ * ] netif_wake_queue

netif_schedule [*] netif_wake_queue

netif_schedule安排设备进行传输。netif_wake_queue启用设备上的传输并安排设备进行传输。请参阅第 11 章中的“安排设备进行传输”部分。

netif_schedule schedules a device for transmission. netif_wake_queue enables transmission on a device and schedules the device for transmission. See the section "Scheduling a Device for Transmission" in Chapter 11.

qdisc_run

qdisc_run

将帧从设备的出口队列中出列,并将其推送到设备驱动程序进行传输。请参阅第 11 章中的“排队规则接口”部分。

Dequeues a frame from the egress queue of a device and pushes it down to the device driver for transmission. See the section "Queuing Discipline Interface" in Chapter 11.

process_backlog

process_backlog

poll非 NAPI 设备驱动程序使用的虚拟函数。请参阅第 10 章中的“积压处理:process_backlog Poll 虚拟函数” 部分。

poll virtual function used by non-NAPI device drivers. See the section "Backlog Processing: The process_backlog Poll Virtual Function" in Chapter 10.

netif_receive_skb

netif_receive_skb

通过将输入帧传递给更高层协议处理程序来处理输入帧。请参阅第 10 章中的“入口帧处理”部分。

Processes input frames by passing them to higher-layer protocol handlers. See the section "Ingress Frame Processing" in Chapter 10.

dev_queue_xmit

dev_queue_xmit

主要功能为帧传输。请参阅第 11 章中的“ dev_queue_xmit 函数” 部分。

Main function for frame transmission. See the section "dev_queue_xmit Function" in Chapter 11.

dev_kfree_skb dev_kfree_skb_irq dev_kfree_skb_any

dev_kfree_skb dev_kfree_skb_irq dev_kfree_skb_any

释放一个sk_buff结构。请参阅第 11 章中的“处理 NET_TX_SOFTIRQ:net_tx_action ”部分。

Releases an sk_buff structure. See the section "Processing the NET_TX_SOFTIRQ: net_tx_action" in Chapter 11.

do_IRQ

do_IRQ

通过调用关联的处理程序来处理硬件中断通知。

Takes care of a hardware interrupt notification by invoking the associated handler.

open_softirq raise_softirq, raise_softirq_irqoff

open_softirq raise_softirq, raise_softirq_irqoff

分别用于执行软件中断的寄存器和时间表。请参阅第 9 章中的“内核 2.4 及以上版本中的下半部处理程序:软中断的介绍” 部分。

Registers and schedules for execution a software interrupt, respectively. See the section "Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq" in Chapter 9.

do_softirq invoke_softirq

do_softirq invoke_softirq

通过调用关联的处理程序来处理待处理的软件中断。请参阅第 9 章中的“待处理软中断处理”部分。

Takes care of the pending software interrupts by invoking the associated handlers. See the section "Pending softirq Handling" in Chapter 9.

net_rx_action net_tx_action

net_rx_action net_tx_action

NET_RX_SOFTIRQ分别是和 软件中断的处理程序NET_TX_SOFTIRQ请参阅第 9 章中的“网络代码如何使用软中断”部分。

The handlers for the NET_RX_SOFTIRQ and NET_TX_SOFTIRQ software interrupts, respectively. See the section "How the Networking Code Uses softirqs" in Chapter 9.

tasklet_init

tasklet_init

初始化一个tasklet_struct 结构体。

Initializes a tasklet_struct structure.

tasklet_action tasklet_hi_action

tasklet_action tasklet_hi_action

TASKLET_SOFTIRQ分别用于和 软件中断的处理程序HI_SOFTIRQ请参阅第 9 章中的“ Tasklet ”部分。

Handlers for the TASKLET_SOFTIRQ and HI_SOFTIRQ software interrupts, respectively. See the section "Tasklets" in Chapter 9.

tasklet_enable, tasklet_hi_enable tasklet_disable, tasklet_disable_nosync

tasklet_enable, tasklet_hi_enable tasklet_disable, tasklet_disable_nosync

分别启用和禁用一个tasklet。请参阅第 9 章中的“ Tasklet ”部分。

Enables and disables a tasklet, respectively. See the section "Tasklets" in Chapter 9.

tasklet_schedule tasklet_hi_schedule

tasklet_schedule tasklet_hi_schedule

安排一个tasklet 的执行。请参阅第 9 章中的“ Tasklet ”部分。

Schedules a tasklet for execution. See the section "Tasklets" in Chapter 9.

Variables

Variables

 

mod_cong lo_cong no_cong no_cong_thresh

mod_cong lo_cong no_cong no_cong_thresh

输入队列的拥塞级别(与非 NAPI 设备一起使用)。请参阅第 9 章中的“ softnet_data 字段”部分。

Congestion levels for the input queue (used with non-NAPI devices). See the section "Fields of softnet_data" in Chapter 9.

netdev_max_backlog

netdev_max_backlog

CPU 输入队列的最大大小。请参见第 9 章中的图 9-4

Maximum size for the CPU's input queues. See Figure 9-4 in Chapter 9.

Data structures

Data structures

 

softnet_data

softnet_data

两个NET_ XXX _SOFTIRQ软件中断为每个 CPU 使用一个这样的结构。请参见第 9 章中的“ softnet_data 结构”部分。

The two NET_ XXX _SOFTIRQ software interrupts use one such structure for each CPU. See the section "softnet_data Structure" in Chapter 9.

tasklet_struct

tasklet_struct

代表一个tasklet。请参阅第 9 章中的“ Tasklet ”部分。

Represents a tasklet. See the section "Tasklets" in Chapter 9.

[ * ]为了与接收函数名称保持一致,它可能应该被调用netif_tx_schedule

[*] For consistency with the reception function names, it should probably have been called netif_tx_schedule.

本书这一部分介绍的文件和目录

Files and Directories Featured in This Part of the Book

图12-1显示了我们在第三部分 的前四章中引用的文件和目录。xxx 图中的关键字代表一种架构(例如i386)。[ * ]某些体系结构不需要特定体系结构的特定文件,因为通用文件有时可以由多个体系结构使用。

Figure 12-1 shows the files and directories we have referenced in the first four chapters of Part III. The xxx keyword in the figure represents an architecture (e.g., i386).[*] Some architectures do not require particular architecture specific files, because a general-purpose file can sometimes be used by multiple architectures.

本书这一部分中的文件和目录

图 12-1。本书这一部分中的文件和目录

Figure 12-1. Files and directories featured in this part of the book




[ * ] irq.c文件可能并不总是位于名为kernel目录中。

[*] The irq.c file may not always be inside a directory called kernel.

第 13 章协议处理程序

Chapter 13. Protocol Handlers

协议是所有通信的框架:它们向每个通信者指示如何理解对话的另一方。在 Linux 中,通信是通过每个网络层的协议处理程序来理解的。本章解释了如何安装、在运行时选择和调用这些处理程序。

Protocols are the framework for all communication: they indicate to each correspondent how to understand the other side of a conversation. In Linux, communication is understood through a protocol handler at each networking layer. This chapter explains how these handlers are installed, chosen at runtime, and invoked.

为了理解通信层和协议之间的关系,想象一下现实生活中我必须与陌生人交谈的可能情况。我应该使用什么语言?如果我在意大利,我会从意大利语开始,如果我在美国,我会尝试英语。如果这些不起作用,可能有一些方法可以协商使用不同的语言。

To understand the relationship among communication layers and protocols, imagine a possible situation in real life where I have to talk to a stranger. What language should I use? If I'm in Italy I'll begin with Italian, and if I'm in the United States I'll try English. If these don't work, there may be ways to negotiate the use of a different language.

除了这个基本协议之外,还有其他协议。例如,在写信时,我与通讯员的关系决定了我是否以“亲爱的女士”或“嗨,姑娘!”开始。这类选择发生在现实生活沟通的许多层面。网络也有层,协议的选择在网络代码中变得形式化。

On top of that basic protocol, there are others. When writing a letter, for instance, my relationship with the correspondent determines whether I begin "Dear Madam" or "Hi, gal!" These sorts of choices take place at many layers of real-life communication. Networks have layers too, and the choice of protocols becomes formalized in network code.

网络堆栈概述

Overview of Network Stack

本书的读者应该熟悉基本的 TCP/IP 协议,但还有一些您可能不知道的其他常用协议,例如逻辑链路控制 (LLC) 和子网访问协议 (SNAP)。本节介绍关键协议并显示它们的关系。

Readers of this book are expected to be familiar with the basic TCP/IP protocols, but there are some other protocols in common use—such as Logical Link Control (LLC) and Subnetwork Access Protocol (SNAP)—that you may not know. This section introduces key protocols and shows their relationships.

两个最著名的网络协议模型是七层 OSI 模型五层TCP/IP模型,如图13-1所示。OSI 模型仍然是网络讨论的重要参考点,尽管它由于各种原因从未流行起来。TCP/IP 模型涵盖了当今计算机使用的大部分协议。[ * ]

The two best-known models for network protocols are the seven-layer OSI model and the five-layer TCP/IP model, shown in Figure 13-1. The OSI model remains an important reference point for networking discussions even though it never took off for a variety of reasons. The TCP/IP model covers most of the protocols used by computers today.[*]

OSI 和 TCP/IP 模型

图 13-1。OSI 和 TCP/IP 模型

Figure 13-1. OSI and TCP/IP models

在每一层,都有许多可用的协议。在接口交换数据的最低级别,所使用的协议是预先确定的。该协议的驱动程序与接口相关联,并且假设进入该接口的所有数据都遵循该协议(即以太网);如果不存在,则会报告错误并且不会进行任何通信。

At each layer, numerous protocols are available. At the lowest level, where interfaces exchange data, the protocol in use is predetermined. A driver for that protocol is associated with the interface, and all data that comes in on the interface is assumed to follow the protocol (i.e., Ethernet); if it doesn't, errors are reported and no communication takes place.

但是,一旦驱动程序必须将数据移交给更高层,就会出现协议选择。L3 的数据是否应该由 IPv4、IPv6、IPX(Novell NetWare 协议)、DECnet 或其他网络层协议处理?从 L3 到 L4(TCP、UDP、ICMP 和其他协议所在的层)也必须做出类似的选择。

But once the driver has to hand over data to a higher layer, a choice of protocols ensues. Should data at L3 be handled by IPv4, IPv6, IPX (the Novell NetWare protocol), DECnet, or some other network-layer protocol? And a similar choice must be made going from L3 to L4, where TCP, UDP, ICMP, and other protocols reside.

本章讨论较低的三层,并简要介绍第四层。

This chapter deals with the lower three layers and briefly touches on the fourth one.

传输数据的单个包通常称为 链路层 L2 上的帧;网络层的数据包;传输层上的一个段;以及应用层的消息。

An individual package of transmitted data is commonly called a frame on the link layer, L2; a packet on the network layer; a segment on the transport layer; and a message on the application layer.

这些层通常称为网络堆栈,因为通信沿着各层向下传播,直到通过有线(或无线频段)进行物理传输,然后向上传播。标头也以 LIFO 方式添加和删除。

The layers are often called the network stack, because communication travels down the layers until it is physically transmitted across the wire (or wireless bands) and then travels back up. Headers are also added and removed in a LIFO manner.

大局观

The Big Picture

图 13-2建立在图 13-1中的 TCP/IP 模型的基础上。图 13-2 显示了哪一章涵盖了相邻层之间的每个接口。其中一些接口涉及堆栈向下的通信,而另一些则涉及向上的通信:

Figure 13-2 builds on the TCP/IP model in Figure 13-1. Figure 13-2 shows which chapter covers each interface between adjacent layers. Some of these interfaces involve communication down the stack, whereas others involve communication upward:

进入堆栈(用于接收消息)
Going up in the stack (for receiving a message)

本章介绍如何将入口流量传递给正确的协议处理程序。ptype_base(和 的含义将在“协议处理程序组织ptype_all”部分中变得清晰。)

第 10 章描述设备驱动程序如何通知内核接收到的入口帧。

第 24 章描述了 IPv4 协议如何将入口 IPv4 数据包传递到正确的 L4 协议(IPv4 是我们在本书中介绍的唯一网络层协议)。IPv4 接收例程在第 19 章中描述。

This chapter describes how ingress traffic is handed to the right protocol handler. (The meaning of ptype_base and ptype_all will become clear in the section "Protocol Handler Organization.")

Chapter 10 describes how device drivers notify the kernel about the reception of ingress frames.

Chapter 24 describes how the IPv4 protocol delivers ingress IPv4 packets to the right L4 protocol (IPv4 is the only network layer protocol we cover in the book). The IPv4 receive routine is described in Chapter 19.

进入堆栈(用于发送消息)
Going down in the stack (for sending a message)

第21章描述了IPv4层为传输提供的功能。

第 VI 部分描述了相邻层如何将 L3 协议连接到传输例程dev_queue_xmit。后者在第 11 章中进行了描述。

Chapter 21 describes the functions provided by the IPv4 layer for transmission.

Part VI describes how the neighboring layer interfaces the L3 protocols to the transmitting routine dev_queue_xmit. The latter is described in Chapter 11.

如图13-2所示,socket接口在本书中不涉及。不过,关于套接字类型,有一点值得一提AF_PACKET。这是Linux在链路层捕获帧并将帧注入链路层的方式,直接绕过所有中间协议层。tcpdump和 Ethereal等网络嗅探器 是套接字的常见用户AF_SOCKET。从图中您可以看到,AF_PACKET套接字直接将帧发送到dev_queue_xmit,并直接从网络协议调度程序例程接收入口帧(后一点将在第 10 章中解决)。

As shown in Figure 13-2, the socket interface is not covered in this book. However, there is one point worth mentioning about the AF_PACKET socket type. It's the Linux way to capture frames at the link layer and inject frames into the link layer, directly bypassing all the intermediate protocol layers. Network sniffers such as tcpdump and Ethereal are common users of AF_SOCKET sockets. You can see from the figure that AF_PACKET sockets hand frames directly to dev_queue_xmit, and receive ingress frames directly from the network protocol dispatcher routine (this latter point is addressed in Chapter 10).

图 13-2仅显示了两个协议族(PF_INETPF_PACKET),但 Linux 内核中还实现了其他几个协议族。其中包括:

Figure 13-2 shows only two protocol families (PF_INET, PF_PACKET), but several others are implemented in the Linux kernel. Among them are:

PF_NETLINK
PF_NETLINK

用作网络配置的首选接口。参见第 3 章

Used as the preferred interface for network configuration. See Chapter 3.

PF_KEY
PF_KEY

用作网络安全服务的密钥管理接口。IPsec 就是这些服务之一。

Used as a key management interface for network security services. IPsec is one of these services.

PF_LLC
PF_LLC

请参阅“逻辑链路控制 (LLC) ”部分。

See the section "Logical Link Control (LLC)."

以太网链路层选择(LLC 和 SNAP)

Link Layer Choices for Ethernet (LLC and SNAP)

尽管链路层协议由所使用的硬件相当固定,但以太网标准允许在协议之间进行一些选择。标准化此选择的第一次尝试称为逻辑链路控制 (LLC)。由于它提供的选项非常有限,因此从未有过太多用途。IEEE 802 委员会随后对子网访问协议 (SNAP) 进行了标准化,它的使用相当频繁。本章稍后将描述这两个子协议的实现。

Although the link layer protocol is fairly fixed by the hardware in use, the Ethernet standard allows some choice between protocols. The first attempt at standardizing this choice was called Logical Link Control (LLC). Since it offered very limited options, it never saw much use. The IEEE 802 committee then standardized the Subnetwork Access Protocol (SNAP) , which is found in use fairly often. The implementation of both of these subprotocols is described later in this chapter.

在 LLC 中,标头包含指定源服务访问点 (SSAP) 协议的字段以及目的地服务接入点 (DSAP) 的协议。然而,每个字段仅包含 8 位,其中一个专用于指示是否使用多播的标志,另一个专用于指示该地址是一个网络本地地址还是全球范围内认可的标志。因此,LLC 还剩下 6 位来指定协议,最多支持 64 个协议,这对于该技术的普及来说太少了。

In LLC, the header contains a field specifying the protocol for the Source Service Access Point (SSAP) and the protocol for the Destination Service Access Point (DSAP) . Each field, however, contains only 8 bits, one of which is dedicated to a flag that indicates whether multicast is in use and another dedicated to a flag that indicates whether the address is local to one network or is recognized worldwide. Therefore, with 6 bits left to specify a protocol, LLC supports a maximum of 64 protocols, which is too few to make the technology popular.

大局观

图 13-2。大局观

Figure 13-2. The big picture

因此,IEEE 802 委员会通过在 SSAP 和 DSAP 字段中提供特殊值来扩展 LLC,该值指示源或目标使用的协议由标头中的另外 5 个字节标识。通过这个称为 SNAP 的扩展,可以将 40 位分配给各种协议。

Therefore, the IEEE 802 committee extended LLC by providing a special value in the SSAP and DSAP fields that indicates that the protocol in use by that source or destination is identified by another 5 bytes in the header. With this extension, called SNAP, there are 40 bits that can be assigned to various protocols.

网络堆栈如何运行

How the Network Stack Operates

让我们简单地检查一个示例通信,看看如何在通信点做出选择。

Let's briefly examine a sample communication to see how choices are made at communication points.

图 13-3中,假设主机 X 上的用户想要使用 Web 浏览器从服务器 Y 上运行的 Web 服务器下载 HTML 页面。需要回答的一些问题包括以下内容:

In Figure 13-3, assume that a user at Host X wants to download an HTML page using a web browser from the web server running on Server Y. Some of the questions to answer include the following:

两个远程站(主机 X 和服务器 Y)之间的通信示例

图 13-3。两个远程站(主机 X 和服务器 Y)之间的通信示例

Figure 13-3. Example of communication between two remote stations (Host X and Server Y)

  • 由于主机 X 和服务器 Y 位于不同的局域网中,因此它们如何能够相互通信?

  • Because Host X and Server Y are on different local area networks, how will they be able to talk to each other?

  • 因为主机 X 不知道服务器 Y 的物理位置,所以它如何知道将其请求发送到哪里?

  • Because Host X does not know where Server Y is physically located, how will it find out where to send its request?

  • 如果服务器 Y 运行多个应用程序(不仅仅是 Web 服务器),其操作系统如何确定哪个应用程序应处理来自主机 X 的请求?

  • If Server Y is running more than one application (not just the web server), how can its operating system determine which application should handle the request from Host X?

  • 如果主机 X 运行多个应用程序(不仅仅是浏览器),其操作系统如何确定哪个应用程序接收返回的数据?

  • If Host X is running more than one application (not just the browser), how can its operating system determine which application receives the data that returns?

让我们通过网络堆栈跟踪网页请求,看看这些问题是如何回答的。我们将使用图 13-3 [ * ]13-4作为参考。

Let's follow the request for a web page through the network stack to see how these questions are answered. We'll use Figures 13-3 [*] and 13-4 as references.

应用层,主机X
Application layer, Host X

浏览器读取用户请求的URL;假设它是http://www.oreilly.com。浏览器使用域名系统(这个主题超出了本书的范围)将域名www.oreilly.com解析为 IP 地址,我们假设是 208.201.239.37。由 IP 协议(​​L3,网络层)使用该地址找到主机 X 和服务器 Y 之间的路径。

浏览器现在在应用层上向 208.201.239.37 发起 HTTP 会话。然后,它调用 TCP 将流量传送到远程 Web 服务器。(使用 TCP 而不是 UDP,因为 HTTP 需要一个可靠的通道,可以传输大量数据而不损坏数据。)请求现在正在网络堆栈中传输。

The browser reads the URL requested by the user; suppose it is http://www.oreilly.com. The browser uses the Domain Name System (a topic beyond the scope of this book) to resolve the domain www.oreilly.com to an IP address, which we'll suppose is 208.201.239.37. It is up to the IP protocol (L3, the network layer) to find a path between Host X and Server Y using this address.

The browser now initiates an HTTP session on the application layer to 208.201.239.37. It then invokes TCP to carry the traffic to the remote web server. (TCP is used instead of UDP because HTTP requires a reliable channel that can deliver large amounts of data without corrupting it.) The request is now traveling down the network stack.

传输层,主机X
Transport layer, Host X

如果需要,TCP 层将 HTTP 消息请求分成多个段,并向每个段添加 TCP 标头。除此之外,TCP 添加了源端口和目标端口。端口号让操作系统将请求定向到正确的应用程序。服务器 Y 上的 Web 服务器侦听默认 HTTP 端口 80,除非明确配置为使用不同的端口号,并接收那里的所有流量。服务器 Y 将响应定向回主机 X 的端口 5000,这是服务器从主机收到的请求中获得的源端口号。

端口号是 L4 概念,因此 TCP 和 UDP 存在一组单独的端口。

主机 X 上的 TCP 层知道目标端口是 80,因为浏览器使用分配给 HTTP 协议的默认端口,除非 URL 中提供了不同的端口。分配给浏览器的源端口(将用于在处理入口流量时识别目标应用程序)由操作系统分配(除非应用程序要求特定端口)。假设端口为 5000。对话的两端可以使用不同的端口。网络地址转换 (NAT) 和代理防火墙使问题变得更加复杂,但是从这个讨论中应该可以清楚地了解应用程序如何相互访问的轮廓。

TCP 层不知道如何将数据段发送到目标系统。为了实现这一点,TCP 层调用 IP 层,在每个传输请求中传递目标 IP 地址。

The TCP layer breaks the HTTP message request into segments, if needed, and adds a TCP header to each. Among other things, TCP adds the source and destination port. The port number lets the operating system direct the request to the proper application. The web server on Server Y listens on the default HTTP port 80 unless it is explicitly configured to use a different port number, and picks up all traffic there. Server Y directs responses back to Host X's port 5000, which is the source port number the server got from the request received from the host.

Port numbers are an L4 concept, so a separate set of ports exist for TCP and UDP.

The TCP layer on Host X knows the destination port is 80 because the browser uses the default port assigned to the HTTP protocol unless a different one is provided in the URL. The source port assigned to the browser (which will be used to identify the target application when processing ingress traffic) is assigned by the OS (unless a specific one is asked by the application). Let's assume that port was 5000. Different ports can be used for the two sides of the conversation. Network Address Translation (NAT) and proxying firewalls complicate the issue even further, but the outlines of how applications reach each other should be clear from this discussion.

The TCP layer does not know how to get the segments to the destination system. To accomplish that, the TCP layer invokes the IP layer, passing the destination IP address in each transmission request.

网络层,主机X
Network layer, Host X

IP层不关心应用程序或端口。它所做的只是检查数据包上的 IP 地址以及与 IP 相关的网络选项。它的主要任务是查询路由表(第七部分详细讨论的复杂过程)以发现数据包应该通过路由器 RT1。IPv4 协议在第五部分详细描述。

数据包将下降到另一层发送到路由器,但 IP 层必须在该层为路由器找到正确的地址。由于 L2 涉及相邻主机(例如共享 LAN 或点对点链路的主机)之间的通信,因此 IP 层用于查找与给定 IP 地址关联的 L2 地址的过程称为邻居协议第六部分对此进行了讨论。

The IP layer does not care about applications or ports. All it does is examine the IP addresses on the packets and the network options related to IP. Its big task is to consult routing tables (a complex process discussed in detail in Part VII) to discover that the packet should go through Router RT1. The IPv4 protocol is described in detail in Part V.

The packet is going to drop down another layer to be sent to the router, but the IP layer has to find the right address on this layer for the router. Since L2 involves communication between neighboring hosts (such as hosts sharing a LAN or a point-to-point link), the process used by the IP layer to find the L2 address associated with a given IP address is called a neighbor protocol. It is discussed in Part VI.

链路层,主机 X 和路由器 RT1
Link layer, Host X and Router RT1

该层部分由设备驱动程序实现。在 LAN 上,以太网是最常见的协议,但也存在 ATM、令牌环、FDDI 等协议。长途链路使用专用铜线或光纤线路;其中最简单的是数百万家庭和小型办公室用户仍然与他们的 ISP 建立的拨号连接。LAN 使用自己的 (L2) 寻址方案,与 TCP/IP 无关;在以太网(以及一般的 IEEE 802 网络)上,地址的长度为 6 个八位字节,通常称为 MAC 地址。在专用线路上(例如,拨号),根本不需要 L2 寻址,因为每一侧只是向另一侧发送。

不同类型的报头可以用在不同的链路上,因为每种报头都是特定于硬件的。这些标头不携带任何对应用层的浏览器和服务器有意义的信息。

This layer is implemented partly by a device driver. On LANs, Ethernet is the most common protocol, but ATM, Token Ring, FDDI, and others exist. Long-distance links use dedicated copper or fiber lines; the simplest of these is the dial-up connection that millions of home and small-office users still establish with their ISPs. LANs use their own (L2) addressing schemes that have nothing to do with TCP/IP; on Ethernet (and in IEEE 802 networks in general), addresses are 6 octets long and are commonly called MAC addresses. On a dedicated line (e.g., dial-up), no L2 addressing is needed at all because each side simply sends to the other side.

Different types of headers might be used on different links, because each is hardware-specific. These headers do not carry any information that is meaningful for the browser and server at the application layer.

路由器 RT1、RT2 等
Routers RT1, RT2, etc.

路径中的每个路由器(除了最后一个)都经过以下过程将数据包转发到其最终目的地:

  • 它删除了链路层标头。

  • 由于链路层报头中的特定字段,可以看出 L3 协议是 IP,这将在本章稍后讨论。

  • 它确定本地系统不是数据包的目的地,因为 IP 标头中包含的目标 IP 地址不是它自己的 IP 地址之一。

  • 它将 IP 数据包转发到通往服务器 Y 的路径上的下一个路由器。为此,它会查阅其路由表来选择下一跳路由器并创建新的链路层标头(即,图 13-4(E) 最后一步在第 35 章中有详细描述。

通常,当数据包从一个系统传送到另一个系统时,L3(IP 标头)上的信息不会发生变化。[ * ]每个链路使用不同的 L2 标头。

当数据包最终到达路由器 RT3 时,后者意识到服务器 Y 是直接连接的,并且不需要将数据包路由到另一跳。

Each router in the path, except for the last, goes through the following process to forward the packet to its final destination:

  • It removes the link layer header.

  • It can see that the L3 protocol is IP thanks to a specific field in the link layer header, discussed later in this chapter.

  • It determines that the local system is not the destination of the packet because the destination IP address included in the IP header is not one of its own IP addresses.

  • It forwards the IP packet to the next router on the path toward Server Y. To do this, it consults its routing tables to select the next hop router and creates a new link layer header (i.e., Figure 13-4(E)). The last step is described in detail in Chapter 35.

Normally, the information on L3 (the IP header) does not change as the packet goes from system to system.[*] Different L2 headers are used on each link.

When the packet finally arrives at Router RT3, the latter realizes that Server Y is directly connected and that there is no need to route the packet another hop.

一旦消息到达目标服务器,它就会再次从下向上遍历网络堆栈:

Once the message reaches the destination server, it traverses the network stack again from the bottom upward:

链路层,服务器Y
Link layer, Server Y

剥离 L2 标头后,该层检查一个字段以查看哪个协议处理 L3 层。发现L3由IP处理,链路层调用适当的函数来继续处理L3数据包(即L2有效负载)。本章的大部分内容讨论协议注册自身和处理指示要使用哪个协议的关键字段的方式。

Stripping off the L2 header, this layer checks a field to see which protocol handles the L3 layer. Finding that L3 is handled by IP, the link layer invokes the appropriate function to continue handling the L3 packet (i.e., L2 payload). Most of this chapter discusses the manner in which protocols register themselves and handle the key field indicating which protocol to use.

网络层,服务器Y
Network layer, Server Y

该层认识到其自身系统的 IP 地址 208.201.239.37 是数据包中的目标地址,因此应在本地处理该数据包。网络层剥离 L3 标头并再次检查字段以查看处理 L4 的协议。第 24 章深入描述了 L3 和 L4 之间入口流量的接口。

This layer recognizes that its own system's IP address, 208.201.239.37, is the destination address in the packet and therefore that the packet should be handled locally. The network layer strips off the L3 header and once again checks a field to see what protocol handles L4. Chapter 24 offers an in-depth description of the interface between L3 and L4 for ingress traffic.

图 13-4显示了每个网络层在从更高层获取数据时如何添加标头。最后一步,从图13-4(d)图13-4(e),显示了主机X发送到路由器RT1的原始帧与路由器RT1和路由器RT2之间的原始帧之间的差异。

Figure 13-4 shows how a header is added by each network layer as each one takes the data from a higher layer. The last step, from Figure 13-4(d) to Figure 13-4(e), shows the difference between the original frame transmitted to Router RT1 by Host X and the one between Router RT1 and Router RT2.

按层编译的标头:当我们沿着堆栈向下移动时,主机 X 上的 (a...d) ; (e) 在路由器 RT1 上

图 13-4。按层编译的标头:当我们沿着堆栈向下移动时,主机 X 上的 (a...d) ;(e) 在路由器 RT1 上

Figure 13-4. Headers compiled by layers: (a...d) on Host X as we travel down the stack; (e) on Router RT1

正如我们所看到的,每一层都提供了多种协议。每个协议都由一组不同的内核函数处理。因此,当数据包返回堆栈时,每个协议必须弄清楚下一个更高层正在使用哪个协议,并调用适当的内核函数来处理数据包。

As we have seen, each layer provides a variety of protocols. Each protocol is handled by a different set of kernel functions. Thus, as the packet travels back up the stack, each protocol must figure out which protocol is being used by the next-higher layer, and invoke the proper kernel function to handle the packet.

在最低软件层 L2 上,使用的硬件定义了协议。如果在以太网接口上接收到该帧,则接收方知道它包含以太网标头,并且令牌环接口知道它包含令牌环标头,依此类推。除非指定 LCC 或 SNAP,否则不会有歧义。LLC 和 SNAP 将在本章后面讨论。

On the lowest software layer, L2, the hardware in use defines the protocol. If the frame is received on an Ethernet interface, the receiver knows it contains an Ethernet header, and a Token Ring interface knows it contains a Token Ring header, and so on. There is no ambiguity unless LCC or SNAP is specified. LLC and SNAP are discussed later in this chapter.

但当数据包在网络堆栈中向上传输时,每个协议的标头中都需要一个字段来告诉它哪个协议应该处理下一阶段的处理。进度如图13-5所示。因此,从图 13-5(a)中的 L2到图 13-5(b)中的 L3 的转换取决于 L2 检查 L2 标头中的“Above protocol”字段。类似地,L3层检查其标头中的字段以促进向L4的转换,如图13-5(b)图13-5(c)所示。最后,L4 使用数据包的目标端口字段将数据包从内核中取出,并找到本地主机上处理数据包的进程,例如 Web 服务器。

But as the packet travels up the network stack, each protocol needs a field in its header to tell it which protocol should handle the next stage of processing. The progress is shown in Figure 13-5. Thus, the transition from L2 in Figure 13-5(a) to L3 in Figure 13-5(b) depends on L2 checking an "Above protocol" field in the L2 header. Similarly, the L3 layer checks a field in its header to facilitate the transition to L4, shown in Figure 13-5(b) and Figure 13-5(c). Finally, L4 uses the Destination Port field of the packet to take the packet out of the kernel and find the process, such as a web server, that handles the packet on the local host.

执行正确的协议处理程序

Executing the Right Protocol Handler

对于每种网络协议,无论其层如何,都有一个初始化函数。这包括 IPv4 和 IPv6 等 L3 协议、ARP 等链路层协议等。对于静态包含在内核中的协议,初始化函数在启动时执行;对于编译为模块的协议,初始化函数在模块加载时执行。该函数分配内部数据结构,通知其他子系统协议的存在,在/proc中注册文件,等等。一项关键任务是在内核中注册一个处理程序来处理协议的流量。

For each network protocol, regardless of its layer, there is one initialization function. This includes L3 protocols such as IPv4 and IPv6, link layer protocols like ARP, and so on. For a protocol included statically in the kernel, the initialization function executes at boot time; for a protocol compiled as a module, the initialization function executes when the module is loaded. The function allocates internal data structures, notifies other subsystems about the protocol's existence, registers files in /proc, and so on. A key task is to register a handler in the kernel that handles the traffic for a protocol.

在本节中,为了简单起见,我将展示设备驱动程序(在 L2 上运行)如何调用 L3 协议,但相同的原理适用于任何层上的任何协议。

In this section, for the sake of simplicity, I'll show how a device driver (which operates on L2) invokes an L3 protocol, but the same principle applies to any protocol on any layer.

当设备驱动程序接收到帧时,它将其存储到sk_buff缓冲区数据结构中,并初始化protocol此处显示的字段:

When the device driver receives a frame, it stores it into an sk_buff buffer data structure and it initializes the protocol field shown here:

结构体sk_buff
{
        …………
        未签名的短协议;
        …………
};
struct sk_buff
{
        ... ... ...
        unsigned short  protocol;
        ... ... ...
};

该字段中的值可以是内核用来识别给定协议的任意值,也可以是传入帧中 MAC 标头的字段。该领域咨询者

The value in this field can be an arbitrary value used by the kernel to identify a given protocol, or a field of a MAC header in the incoming frame. The field is consulted by

在服务器 Y 逐层进行帧解封装

图 13-5。在服务器 Y 逐层进行帧解封装

Figure 13-5. Frame decapsulation, layer by layer, at Server Y

内核函数(第 10 章netif_receive_skb中描述 )来确定执行哪个函数处理程序来处理 L3 的数据包。见图13-6

the kernel function netif_receive_skb (described in Chapter 10) to determine which function handler to execute to process the packet at L3. See Figure 13-6.

netif_receive_skb 根据sk_buff缓冲区的协议字段进行处理

图 13-6。netif_receive_skb 根据sk_buff缓冲区的协议字段进行处理

Figure 13-6. netif_receive_skb processes according to the protocol field of the sk_buff buffer

内核用来引用该protocol字段中的协议的大多数值都列在include/linux/if_ether.h中,名称为ETH_P_ XXX。尽管有ETH前缀,但并非所有名称都指以太网硬件。如表 13-1所示,它们可以涵盖广泛的活动。表 13-1列出了内核内部使用的值,这些值skb->protocol由设备驱动程序直接分配,而不是从帧头中提取。(表中省略的那些未分配函数处理程序。)例如,表的第一行表示内核处理程序用于ipx_rcv处理skb->protocol字段为 的传入数据包ETH_P_802_3

Most of the values used by the kernel to refer to the protocols in the protocol field are listed in include/linux/if_ether.h with the name ETH_P_ XXX. Despite the ETH prefix, not all names refer to Ethernet hardware. As Table 13-1 shows, they can cover a wide range of activities. Table 13-1 lists the values used internally by the kernel, which are assigned directly to skb->protocol by device drivers instead of being extracted from a frame header. (The ones omitted from the table are not assigned a function handler.) The first row of the table, for instance, indicates that the kernel handler ipx_rcv is used to process an incoming packet whose skb->protocol field is ETH_P_802_3.

表 13-1。内部协议

Table 13-1. Internal protocols

象征

Symbol

价值

Value

函数处理程序

Function handler

ETH_P_802_3

ETH_P_802_3

0x0001

0x0001

ipx_rcv

ipx_rcv

ETH_P_AX25

ETH_P_AX25

0x0002

0x0002

ax25_kiss_rcv

ax25_kiss_rcv

ETH_P_ALL

ETH_P_ALL

0x0003

0x0003

这不是真正的协议。它用作处理程序的通配符,例如侦听所有协议的数据包嗅探器。

This is not a real protocol. It is used as a wildcard for a handler such as a packet sniffer that listens to all the protocols.

ETH_P_802_2

ETH_P_802_2

ETH_P_TR_802_2

ETH_P_TR_802_2

0x0004

0x0004

0x0011

0x0011

llc_rcv

llc_rcv

ETH_P_WAN_PPP

ETH_P_WAN_PPP

0x0007

0x0007

sppp_rcv

sppp_rcv

ETH_P_LOCALTALK

ETH_P_LOCALTALK

0x0009

0x0009

ltalk_rcv

ltalk_rcv

ETH_P_PPPTALK

ETH_P_PPPTALK

0x0010

0x0010

atalk_rcv

atalk_rcv

ETH_P_IRDA

ETH_P_IRDA

0x0017

0x0017

irlap_driver_rcv

irlap_driver_rcv

ETH_P_ECONET

ETH_P_ECONET

0x0018

0x0018

econet_rcv

econet_rcv

ETH_P_HDLC

ETH_P_HDLC

0x0019

0x0019

hdlc_rcv

hdlc_rcv

并非所有ETH_P_ XXX值都分配有处理程序。在两种情况下可以将它们保留为未分配状态:

Not all the ETH_P_ XXX values are assigned a handler. They can be left unassigned in two circumstances:

不幸的是,从 L2 标头中提取字段来确定要调用哪个处理程序并不总是足够的;和将处理帧的协议处理程序之间的关联skb->protocol并不总是一对一的。在某些情况下,给定的协议处理程序ETH_P_ XXX实际上只是从帧头读取其他参数(而不处理帧),并将帧交给另一个将处理它的协议处理程序。一个例子是ETH_P_802_2处理程序。

Unfortunately, it is not always sufficient to extract a field from the L2 header to figure out which handler to invoke; the association between skb->protocol and the protocol handler that will process the frame is not always one-to-one. There are cases where the protocol handler for a given ETH_P_ XXX will actually just read other parameters from the frame header (without processing the frame) and hand the frame to another protocol handler that will process it. An example is the ETH_P_802_2 handler.

如第 10 章所述,netif_receive_skb是将入口帧分派到正确的协议处理程序的函数。当没有特定协议的处理程序时,该帧将被丢弃。

As described in Chapter 10, netif_receive_skb is the function that dispatches ingress frames to the right protocol handlers. When there is no handler for a specific protocol, the frame is dropped.

在特殊情况下,单个数据包可以传递给多个处理程序。例如,当数据包嗅探器运行时就是这种情况。这种操作模式有时称为混杂模式,如 表13-1ETH_P_ALL所示。这种类型的处理程序通常不用于为接收者处理数据包,而只是为了调试或收集统计数据而监听给定设备或一组设备。

In special cases, a single packet can be delivered to multiple handlers. This is the case, for instance, when packet sniffers are running. This mode of operation, sometimes referred to as promiscuous mode , is listed as ETH_P_ALL in Table 13-1. This type of handler is generally not used to process packets for recipients, but just to snoop on a given device or set of devices for the purposes of debugging or collecting statistics.

特殊介质封装

Special Media Encapsulation

以太网是迄今为止用于实现共享和点对点网络连接的最常用机制。在本书中,当谈论 L2 时,我们总是提到以太网设备驱动程序。然而,Linux 允许您使用现代 PC 上可用的任何最常见媒体来传输 IP 流量(有时还包括任何网络协议流量)。可用于传输 IP 的介质示例包括串行和并行端口 (SLIP/PLIP/PPP)、FireWire (eth1394)、USB、蓝牙、红外数据协会 (IrDA) 等。

Ethernet is by far the most common mechanism used for implementing both shared and point-to-point network connections. In this book, we always refer to Ethernet device drivers when talking about L2. However, Linux allows you to use any of the most common media available on modern PCs to carry IP traffic (and sometimes any network protocol traffic). Examples of media that can be used to transport IP include the serial and parallel ports (SLIP/PLIP/PPP), FireWire (eth1394), USB, Bluetooth, Infrared Data Association (IrDA), etc.

此类媒体通常通过对通用媒体设备驱动程序的扩展将网络设备定义为通用端口之上的抽象。对于上层来说,此类虚拟设备看起来就像真实的网卡。

Such media define network devices as abstractions on top of the generic ports, usually by means of extensions to the generic media device driver. Such virtual devices look like real NICs to the upper layers.

以下是这些虚拟网络设备上的接收和发送的实现方式:

Here is how reception and transmission are implemented on these virtual network devices:

传播
Transmission

net_device虚拟设备的函数hard_start_xmit指针由设备驱动程序初始化为一个例程,该例程将根据媒体使用的协议封装 IP 数据包(假设它是一个 IP 数据包)。

The net_device's hard_start_xmit function pointer of the virtual device is initialized by the device driver to a routine that will encapsulate the IP packet (let's assume it was an IP packet) according to the protocol used by the media.

接待
Reception

当通用驱动程序从其端口之一接收数据时,它会剥离媒体标头(就像以太网设备驱动程序会剥离以太网标头一样),初始化 ,并通过调用 通知skb->protocol上层netif_rx。当这些媒体仅用于点对点连接时,不需要链路层标头,因此skb->protocol静态初始化为ETH_P_IP;在其他情况下,媒体封装也可能包含假以太网标头,因此skb->protocol使用例程进行初始化eth_type_trans(与真正的以太网驱动程序一样)。

When the generic driver receives data from one of its ports, it strips the media headers (as an Ethernet device driver would strip the Ethernet header), initializes skb->protocol, and notifies the upper layer with a call to netif_rx. When these media are used for point-to-point connections only, there is no need for a link layer header, so skb->protocol is statically initialized to ETH_P_IP; in the other cases, the media encapsulation may include a fake Ethernet header too, so skb->protocol is initialized with eth_type_trans routines (as real Ethernet drivers are).

给定媒体类型的通用设备驱动程序究竟如何接口到虚拟网络设备是实现细节。根据介质的不同,它可能提供同步或异步接口、在接收和发送路径上使用缓冲等。

How exactly the generic device driver of a given media type interfaces to the virtual network device is an implementation detail. Depending on the medium, it may offer a synchronous or asynchronous interface, use of buffering both on receive and transmit paths, etc.

协议处理程序组织

Protocol Handler Organization

图 13-7显示了不同的协议处理程序在内核中的组织方式。每个协议都由数据结构描述packet_type

Figure 13-7 shows how the different protocol handlers are organized in the kernel. Each protocol is described by a packet_type data structure.

用于存储注册的协议处理程序的数据结构

图 13-7。用于存储注册的协议处理程序的数据结构

Figure 13-7. Data structure used to store the registered protocol handlers

为了使访问速度更快,大多数协议都使用非常简单的哈希函数。十六个列表被组织成全局变量所指向的数组ptype_basedev_add_pack当使用下一节中描述的函数注册协议时 ,该函数对协议类型运行哈希函数,并将packet_type结构分配给 16 个列表之一。稍后,为了找到packet_type结构,内核可以简单地重新运行哈希并遍历匹配列表。

To make access faster, a very simple hash function is used for most of the protocols. Sixteen lists are organized into an array to which the global variable ptype_base points. When a protocol is registered, using the dev_add_pack function, described in the next section, this function runs a hash function over the protocol type and assigns the packet_type structure to one of the 16 lists. Later on, to find a packet_type structure, the kernel can simply rerun the hash and go through the matching list.

协议ETH_P_ALL(参见表13-1)被组织在全局变量ptype_all 指向的它们自己的列表中。[ * ]此列表中的协议数量存储在 中netdev_nit。后者用于dev_queue_xmit检查套接字qdisc_restart是否PF_PACKET打开(即侦听嗅探器),以便可以向其传递入口帧的副本(请参阅第 10 章)。

The ETH_P_ALL protocols (see Table 13-1) are organized in their own list to which the global variable ptype_all points.[*] The number of protocols in this list is stored in netdev_nit. The latter is used by dev_queue_xmit and qdisc_restart to check whether a PF_PACKET socket is open (i.e., a listening sniffer) to which it can deliver a copy of ingress frames (see Chapter 10).

协议处理程序注册

Protocol Handler Registration

在系统启动时以及注册协议的其他时间,内核调用dev_add_pack,向其传递类型 的数据结构packet_type,该数据结构在 include/linux/netdevice.h中定义如下:

At system startup and other times when a protocol is registered, the kernel calls dev_add_pack, passing it a data structure of type packet_type, which is defined in include/linux/netdevice.h as follows:

结构数据包类型
{
    无符号短类型;
    结构体net_device *dev;
    int (*func) (结构 sk_buff *, 结构 net_device *,
                    结构数据包类型 *);
    无效*af_packet_priv;
    结构list_head *列表;
};
struct packet_type
{
    unsigned short        type;
    struct net_device       *dev;
    int            (*func) (struct sk_buff *, struct net_device *,
                    struct packet_type *);
    void            *af_packet_priv;
    struct list_head    *list;
};

各个字段的含义如下:

The fields have the following meanings:

type
type

协议代码。它可以采用表 13-1到 13-4第一列中列出的任何值 (即 ETH_P_IP)。属于不同表的协议之间的差异将在以下部分中变得清晰。

The protocol code. It can take any of the values listed in the first column of Table 13-1 through 13-4 (i.e., ETH_P_IP). The difference between the protocols belonging to different tables will become clear in the following sections.

dev
dev

指向要启用协议的设备(即eth0 )的指针。设置 NULL 表示“所有设备”。由于这个参数,不同的设备可以有不同的处理程序,或者将处理程序与一个特定的设备相关联。通常不会这样做,但对于测试可能有用。PF_PACKET 套接字通常使用它来仅侦听特定设备。例如,诸如tcpdump -i eth0之类的命令packet_type通过套接字创建一个实例PF_PACKET,并初始化dev 为与eth0net_device关联的实例 。

Pointer to the device (i.e., eth0) for which the protocol is to be enabled. A setting of NULL means "all devices." Thanks to this parameter, it would be possible to have different handlers for different devices, or associate a handler with one specific device. This is not normally done, but could be useful for testing. PF_PACKET sockets commonly use it to listen only on a specific device. For instance, a command such as tcpdump -i eth0 creates a packet_type instance via a PF_PACKET socket and initializes dev to the net_device instance associated with eth0.

func
func

调用的函数处理程序 netif_receive_skb(参见第10章)当它需要处理一帧时skb->protocol=type(一个例子是 ip_rcv)。请注意, 的输入参数之一func是指向结构的指针packet_type:套接字使用它PF_PACKET来访问该af_packet_priv字段。

The function handler called by netif_receive_skb (see Chapter 10) when it needs to process one frame with skb->protocol=type (an example is ip_rcv). Note that one of func's input parameters is a pointer to a packet_type structure: it is used by PF_PACKET sockets to access the af_packet_priv field.

af_packet_priv
af_packet_priv

由套接字使用PF_PACKETsock它是指向与结构创建者关联的数据结构的指针packet_type。它用于允许dev_queue_xmit_nit例程(参见 第 10 章)不向发送方传递缓冲区,并由接收PF_PACKET例程将入口数据传递到正确的套接字。

Used by PF_PACKET sockets. It is a pointer to the sock data structure associated with the creator of the packet_type structure. It is used to allow the dev_queue_xmit_nit routine (seen in Chapter 10) not to deliver a buffer to the sender as well, and by the PF_PACKET receive routine to deliver ingress data to the right socket.

list
list

用于将数据结构链接到同一存储桶列表上发生冲突的其他实例。见图13-7

Used to link the data structure to the other instances that collide on the same bucket's list. See Figure 13-7.

当您有多个packet_type 与同一type协议关联的实例时,匹配的入口帧type将通过调用所有协议处理程序实例来传递给func所有协议处理程序实例。详细信息请参见第 10 章。

When you have multiple instances of packet_type associated with the same type protocol, ingress frames that match type are handed to all protocol handler instances by invoking func for all of them. See Chapter 10 for more details.

为了注册每个协议,内核初始化该packet_type结构,然后调用dev_add_pack. 以下是net/ipv4/ip_output.c中的示例,显示了 IPv4 核心代码如何注册 IPv4 协议处理程序。

To register each protocol, the kernel initializes the packet_type structure and then calls dev_add_pack. Here is an example from net/ipv4/ip_output.c that shows how the IPv4 protocol handler is registered by the IPv4 core code.

当 IPv4 协议在启动时初始化时,该ip_init函数就会被执行。ip_rcv结果之一是, IPv4 结构中的函数packet_type被注册为协议的函数处理程序。所有接收到的“协议高于”值为 的以太网帧都ETH_P_IP将由该函数进行处理ip_rcv

When the IPv4 protocol is initialized at boot time, the ip_init function is executed. As one result, the function ip_rcv in the IPv4 packet_type structure is registered as the protocol's function handler. All the Ethernet frames received with a "Protocol Above" value of ETH_P_IP will then be processed by the function ip_rcv.

静态结构数据包类型 ip_packet_type =
{
    .type = _ _constant_htons(ETH_P_IP),
    .func = ip_rcv,
}
...
无效__init ip_init(无效)
{
    dev_add_pack(&ip_packet_type);
    ...
}
static struct packet_type ip_packet_type =
{
    .type = _ _constant_htons(ETH_P_IP),
    .func = ip_rcv,
}
...
void _ _init ip_init(void)
{
    dev_add_pack(&ip_packet_type);
    ...
}

dev_add_pack非常简单:它检查要添加的处理程序是否是协议嗅探器 ( pt->type==htons(ETH_P_ALL))。如果是,该函数将其添加到 指向的列表中,ptype_all并增加注册的协议嗅探器的数量 ( netdev_nit++)。ptype_base如果处理程序不是嗅探器,则根据哈希码的值将其插入到 指向的 16 个列表之一中。自旋锁指向ptype_baseptype_all受自旋锁保护的数据结构ptype_lock

dev_add_pack is quite simple: it checks whether the handler to add is a protocol sniffer (pt->type==htons(ETH_P_ALL)). If so, the function adds it to the list pointed to by ptype_all and increments the number of protocol sniffers registered (netdev_nit++). If the handler is not a sniffer, it is inserted into one of the 16 lists pointed to by ptype_base depending on the value of the hash code. The data structures pointed to by ptype_base and ptype_all are protected by the ptype_lock spin lock.

void dev_add_pack(struct packet_type *pt)
{
    整数哈希;
 
    spin_lock_bh(&ptype_lock);
    if (pt->type == htons(ETH_P_ALL)) {
        netdev_nit++;
        list_add_rcu(&pt->list, &ptype_all);
    } 别的 {
        哈希 = ntohs(pt->类型) & 15;
        list_add_rcu(&pt->list, &ptype_base[hash]);
    }
    spin_unlock(&ptype_lock);
}
void dev_add_pack(struct packet_type *pt)
{
    int hash;
 
    spin_lock_bh(&ptype_lock);
    if (pt->type == htons(ETH_P_ALL)) {
        netdev_nit++;
        list_add_rcu(&pt->list, &ptype_all);
    } else {
        hash = ntohs(pt->type) & 15;
        list_add_rcu(&pt->list, &ptype_base[hash]);
    }
    spin_unlock(&ptype_lock);
}

dev_remove_pack顾名思义,该函数是 的补充dev_add_pack

The function dev_remove_pack, as the name suggests, is complementary to dev_add_pack.

void dev_remove_pack(struct packet_type *pt)
{
    _ _dev_remove_pack(pt);
 
    同步网络();
}
void dev_remove_pack(struct packet_type *pt)
{
    _ _dev_remove_pack(pt);
 
    synchronize_net( );
}

_ _dev_remove_packpacket_typeptype_allor中删除结构ptype_base,并synchronize_net用于确保在返回时dev_remove_pack,没有人持有对已删除实例的引用(例如,请参见第 10 章packet_typeRCU 锁定的使用)。netif_receive_skb

_ _dev_remove_pack removes the packet_type structure from ptype_all or ptype_base, and synchronize_net is used to make sure that by the time dev_remove_pack returns, no one is holding a reference to the removed packet_type instance (see, for example, the use of RCU locking in netif_receive_skb in Chapter 10).

如果是在负责模块初始化的dev_add_pack函数中调用的 ,那么很可能 是在要删除模块时由内核调用的函数中调用的。(您可以在net/ax25/af_ax25.c中找到示例。)另一方面,如果协议静态包含在内核中,则它将在启动时自动注册,并仅在系统关闭时删除。IPv4 协议是在运行时永远不会被删除的协议的一个示例。init_moduledev_remove_packcleanup_module

If dev_add_pack was called within the function init_module, which is in charge of module initialization, dev_remove_pack is most likely within cleanup_module, which is called by the kernel when the module is to be removed. (You can find an example in net/ax25/af_ax25.c.) On the other hand, if the protocol was statically included in the kernel, it would be registered automatically at boot time and removed only when the system shuts down. The IPv4 protocol is an example of a protocol that is never removed at runtime.

以太网与 IEEE 802.3 帧

Ethernet Versus IEEE 802.3 Frames

许多协议都属于宽松术语 “以太网”。802.2 和 802.3 标准分别由协议ETH_P_802_2和表示ETH_P_802_3,但还有许多其他以太网协议(如表 13-2中列出)以及 LLC 和 SNAP 扩展。该标准制定了一些技巧来支持所有这些变体(h_proto将在下一节中讨论)。

A number of protocols go under the loose term Ethernet. The 802.2 and 802.3 standards are represented by the protocols ETH_P_802_2 and ETH_P_802_3, respectively, but there are many other Ethernet protocols, listed in Table 13-2, as well as the LLC and SNAP extensions. The standards institute a couple of hacks to support all of these variations (h_proto isdiscussed in the following section).

表 13-2。有效的以太网类型(当 h_proto > 1536 时)

Table 13-2. Valid Ethernet types (when h_proto > 1536)

协议

Protocol

以太网类型

Ethernet type

函数处理程序

Function handler

a IP 有两个处理程序的原因与内核可以通过 RARP/BOOTP 等协议检索 IP 配置有关。ic_bootp_recv处理程序仅在启动时使用来处理动态 IP 配置,并且在检索配置后将其卸载。请参阅net/ipv4/ipconfig.c

a The reason why IP has two handlers has to do with the possibility for the kernel to retrieve the IP configuration by means of protocols like RARP/BOOTP. The ic_bootp_recv handler is used only at boot time to take care of the dynamic IP configuration, and it is uninstalled once the configuration has been retrieved. See net/ipv4/ipconfig.c.

ETH_P_IP

ETH_P_IP

0x0800

0x0800

ip_rcv

ip_rcv

ic_bootp_recv A

ic_bootp_recv a

ETH_P_X25

ETH_P_X25

0x0805

0x0805

X25_lap_receive_frame

X25_lap_receive_frame

ETH_P_ARP

ETH_P_ARP

0x0806

0x0806

arp_rcv

arp_rcv

ETH_P_BPQ

ETH_P_BPQ

0x08FF

0x08FF

bpq_rcv

bpq_rcv

ETH_P_DNA_RT

ETH_P_DNA_RT

0x6003

0x6003

dn_route_rcv

dn_route_rcv

ETH_P_RARP

ETH_P_RARP

0x8035

0x8035

ic_rarp_recv

ic_rarp_recv

ETH_P_8021Q

ETH_P_8021Q

0x8100

0x8100

vlan_skb_rcv

vlan_skb_rcv

ETH_P_IPX

ETH_P_IPX

0x8137

0x8137

ipx_rcv

ipx_rcv

ETH_P_IPV6

ETH_P_IPV6

0x86DD

0x86DD

ipv6_rcv

ipv6_rcv

ETH_P_PPP_DISC

ETH_P_PPP_DISC

ETH_P_PPP_SES

ETH_P_PPP_SES

0x8863

0x8863

0x8864

0x8864

pppoe_disc_rcv

pppoe_disc_rcv

pppoe_rcv

pppoe_rcv

以太网是在 IEEE 创建 802.2 和 802.3 标准之前设计的。后者并不是纯粹的以太网,尽管它们通常被称为以太网标准。幸运的是,IEEE 802 委员会决定使这些协议兼容。每个以太网卡都能够接收 802 标准帧类型和旧的以太网帧,并且内核提供了一个例程(本节稍后讨论),借助本节中描述的解决方案,允许设备驱动程序识别它们。

Ethernet was designed before the IEEE created its 802.2 and 802.3 standards. The latter are not pure Ethernet, even though they are commonly called Ethernet standards. Fortunately, the IEEE 802 committee decided to make the protocols compatible. Every Ethernet card is able to receive both the 802 standard frame types and the old Ethernet frames, and the kernel provides a routine (discussed later in this section) that allows device drivers to recognize them thanks to the solution described in this section.

这是以太网标头的定义:

This is the definition of an Ethernet header:

结构ethhdr
{
        无符号字符 h_dest[ETH_ALEN]; /* 目标 eth 地址 */
        无符号字符 h_source[ETH_ALEN]; /* 源以太地址 */
        无符号短h_proto;/* 数据包类型ID字段 */
} _ _ATTRIBUTE__ ((打包));
struct ethhdr
{
        unsigned char   h_dest[ETH_ALEN];     /* destination eth addr  */
        unsigned char   h_source[ETH_ALEN];   /* source ether addr     */
        unsigned short  h_proto;              /* packet type ID field  */
} _ _ATTRIBUTE_ _ ((packed));

正如您将在接下来关于 LLC 和 SNAP 的两节中看到的那样,其他字段可以遵循该 ethhdr结构。这里我们重点关注协议领域h_proto。尽管它的名字如此,它实际上可以存储正在使用的协议或帧的长度。这是因为它的大小为 2 个八位位组(字节),但以太网帧的最大大小为 1,500 字节。(实际上,如果包括SA、DA、Checksum和Preamble,大小可以达到1,518。使用802.1q的帧有四个额外字节的封装,因此可以达到1,522字节的大小。)

As you will see in the next two sections on LLC and SNAP, other fields can follow the ethhdr structure. Here we are focusing on the protocol field, h_proto. Despite its name, it actually can store either the protocol in use or the length of the frame. This is because it is 2 octets (bytes) in size, but the maximum size of an Ethernet frame is 1,500 bytes. (Actually, the size can reach 1,518 if SA, DA, Checksum, and Preamble are included. Frames using 802.1q have four extra bytes of encapsulation and can therefore reach a size of 1,522 bytes.)

为了节省空间,IEEE 决定使用大于 1,536 的值来表示以太网协议。一些标识符低于 1,536(0x600 十六进制)的现有协议已更新以满足标准。然而,802.2 和 802.3 协议使用该字段来存储帧的长度。[ * ] 1,501 到 1,535 范围内的值在此字段中不合法。

To save space, the IEEE decided to use values greater than 1,536 to represent the Ethernet protocol. Some preexisting protocols with identifiers lower than 1,536 (0x600 hexadecimal) were updated to meet the criteria. The 802.2 and 802.3 protocols, however, use the field to store the length of the frame.[*] Values ranging from 1,501 to 1,535 are not legal in this field.

图 13-8显示了以太网报头可能的变化。简单以太网如图(a)所示。802.2 和 802.3 变体如 (b) 所示。正如您所看到的,单个字段充当前者的协议字段和后者的长度字段。此外,802 变体可以支持 LLC(如(c)所示)和 SNAP(如(d)所示)。

Figure 13-8 shows the variations possible on an Ethernet header. Simple Ethernet is shown in (a). The 802.2 and 802.3 variant is shown in (b). As you can see, a single field serves as the protocol field in the former and the length field in the latter. In addition, the 802 variant can support LLC, as shown in (c) and SNAP, as shown in (d).

以太网帧和 802.3 帧之间的差异

图 13-8。以太网帧和 802.3 帧之间的差异

Figure 13-8. Differences between Ethernet and 802.3 frames

Linux 在函数中处理协议和长度之间的奇怪区别eth_type_trans。典型的上下文由以下代码片段表示,由drivers/net/3c509.c以太网驱动程序在接收帧时发出。netif_rx是将帧复制到输入队列并设置标志NET_RX_SOFTIRQ以让内核知道队列中的新帧的函数(这在第 10 章 [ ]中进行了描述)。在调用之前netif_rx,调用者通过调用执行一些重要的初始化eth_type_trans

Linux deals with the odd distinction between protocol and length in the eth_type_trans function. A typical context is represented by the following code fragment, issued by the drivers/net/3c509.c Ethernet driver when it receives a frame. netif_rx is the function that copies the frame into the input queue and sets the NET_RX_SOFTIRQ flag to let the kernel know about the new frame in the queue (this is described in Chapter 10 []). Just before invoking netif_rx, the caller performs some important initializations with a call to eth_type_trans.

el3_rx(结构设备*dev)
{
        …………
        skb->协议 = eth_type_trans(skb,dev);
        netif_rx(skb);
        …………
}
el3_rx(struct device *dev)
{
        ... ... ...
        skb->protocol = eth_type_trans(skb,dev);
        netif_rx(skb);
        ... ... ...
}

eth_type_trans执行两个主要任务:设置数据包类型[ * ]和设置协议。它在返回值中执行后者。在关注本节的主要问题(协议)之前,让我们先处理掉前一个任务。

eth_type_trans performs two main tasks: setting the packet type[*] and setting the protocol. It does the latter in its return value. Let's dispose of the former task before concentrating on the main issue in this section, the protocol.

设置数据包类型

Setting the Packet Type

eth_type_trans函数设置为include/linux/if_packet.h中列出的值skb->pkt_type之一:PACKET_ XXX

The eth_type_trans function sets skb->pkt_type to one of the PACKET_ XXX values listed in include/linux/if_packet.h:

PACKET_BROADCAST
PACKET_BROADCAST

该帧被发送到链路层广播地址(即,对于以太网,FF:FF:FF:FF:FF:FF)

The frame was sent to the link layer broadcast address (i.e., FF:FF:FF:FF:FF:FF for Ethernet)

PACKET_MULTICAST
PACKET_MULTICAST

该帧被发送到链路层多播地址。详细信息将在本节后面出现。

The frame was sent to a link layer multicast address. Details appear later in this section.

PACKET_OTHERHOST
PACKET_OTHERHOST

该帧未发送至接收接口。但是,该帧不会立即丢弃,而是传递到下一个最高层。如前所述,可能有协议嗅探器或其他爱管闲事的协议想要查看该帧。

The frame was not addressed to the receiving interface. However, the frame is not dropped right away but is passed to the next-highest layer. As described earlier, there could be protocol sniffers or other meddlesome protocols that would like to give the frame a look.

eth_type_transskb->pkt_type显式设置时,其值最终为 0,即PACKET_HOST。这意味着接收接口是帧的接收者(从链路层的角度来看,即MAC地址匹配)。

When eth_type_trans does not set skb->pkt_type explicitly, its value ends up being 0, which is PACKET_HOST. This means the receiving interface is the recipient of the frame (from a link layer point of view, that is to say, the MAC address matched).

设置正确数据包类型所需的大部分信息均在标头中明确指定。以太网地址的长度为 48 位或 6 字节。第一个字节(按网络字节顺序)的两个最高有效位具有特殊含义(见图 13-9):

Most of the information needed to set the correct packet type is specified explicitly in the header. An Ethernet address is 48 bits or 6 bytes long. The two most significant bits of the first byte (in network byte order) have a special meaning (see Figure 13-9):

  • 位 0 区分多播地址和单播地址。广播地址是多播的特殊情况。当设置为1时,该位​​表示组播;当为0时,表示单播。检查完该位后if(*eth->h_dest&1),该函数将通过将该地址与设备的广播地址进行比较来判断该帧是否为广播帧memcmp(eth->h_dest,dev->broadcast, ETH_ALEN)

    MAC 地址中的单播/多播和本地/全局位

    图 13-9。MAC 地址中的单播/多播和本地/全局位

  • Bit 0 distinguishes multicast addresses from unicast addresses. Broadcast addresses are a special case of multicast. When set to 1, this bit denotes multicast; when 0, it denotes unicast. After checking the bit through if(*eth->h_dest&1), the function goes on to see whether the frame is a broadcast frame by comparing the address to the device's broadcast address through memcmp(eth->h_dest,dev->broadcast, ETH_ALEN).

    Figure 13-9. Unicast/multicast and local/global bits in the MAC address

  • 位 1 区分本地地址和全局地址。全局地址在全球范围内是唯一的,而本地地址则不然:由系统管理员来正确分配本地地址。[ * ]设置为 1 时,该位表示全局地址;当为0时,表示本地地址。

  • Bit 1 distinguishes local addresses from global addresses. Global addresses are worldwide unique, local addresses are not: it is up to the system administrator to assign local addresses properly.[*] When set to 1, this bit denotes a global address; when 0, it denotes a local address.

因此,第一部分eth_type_trans是:

Thus, the first part of eth_type_trans is:

无符号短 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
{
    结构 ethhdr *eth;
    无符号字符 *rawp;
 
    skb->mac.raw=skb->数据;
    skb_pull(skb,ETH_HLEN);
    eth= eth_hdr(skb);
    skb->input_dev = dev;
 
    if(*eth->h_dest&1)
    {
        if(memcmp(eth->h_dest,dev->广播, ETH_ALEN)==0)
            skb->pkt_type=PACKET_BROADCAST;
        别的
            skb->pkt_type=PACKET_MULTICAST;
    }
 
    else if(1 /*dev->flags&IFF_PROMISC*/)
    {
        if(memcmp(eth->h_dest,dev->dev_addr, ETH_ALEN))
            skb->pkt_type=PACKET_OTHERHOST;
    }
unsigned short eth_type_trans(struct sk_buff *skb, struct net_device *dev)
{
    struct ethhdr *eth;
    unsigned char *rawp;
 
    skb->mac.raw=skb->data;
    skb_pull(skb,ETH_HLEN);
    eth= eth_hdr(skb);
    skb->input_dev = dev;
 
    if(*eth->h_dest&1)
    {
        if(memcmp(eth->h_dest,dev->broadcast, ETH_ALEN)==0)
            skb->pkt_type=PACKET_BROADCAST;
        else
            skb->pkt_type=PACKET_MULTICAST;
    }
 
    else if(1 /*dev->flags&IFF_PROMISC*/)
    {
        if(memcmp(eth->h_dest,dev->dev_addr, ETH_ALEN))
            skb->pkt_type=PACKET_OTHERHOST;
    }

当接口进入混杂模式时IFF_PROMISC设置该标志dev->flags。如上一个快照所示,eth_type_trans初始化skb->pkt_typePACKET_OTHERHOST当目标 MAC 地址与接收接口的地址不匹配时,无论标志如何IFF_PROMISC。这将允许PF_SOCKETS处理程序接收帧的副本(参见netif_receive_skb10 章),但上层协议处理程序必须丢弃 PACKET_OTHERHOST类型的缓冲区(例如,参见arp_rcvip_rcv)。

The IFF_PROMISC flag is set in dev->flags when the interface is put into promiscuous mode . As shown in the previous snapshot, eth_type_trans initializes skb->pkt_type to PACKET_OTHERHOST when the destination MAC address does not match the receiving interface's address, regardless of the IFF_PROMISC flag. This will allow PF_SOCKETS handlers to receive a copy of the frame (see netif_receive_skb in Chapter 10), but the upper-layer protocol handlers must discard buffers of PACKET_OTHERHOST type (see, for example, arp_rcv and ip_rcv).

设置以太网协议和长度

Setting the Ethernet Protocol and Length

第二部分eth_type_trans检索高层使用的协议的标识符。协议值也称为 Ethertype,有效类型列表在http://standards.ieee.org/regauth/ethertype上保持最新。1536以上的旧以太网协议和802协议之间的区别是在以下代码片段中进行的:

The second part of eth_type_trans retrieves the identifier of the protocol used at the higher layer. Protocol values are also called Ethertypes, and the list of valid types is kept up-to-date at http://standards.ieee.org/regauth/ethertype. The distinction between old Ethernet protocols above the value of 1,536 and 802 protocols is made in the following code fragment:

    如果 (ntohs(eth->h_proto) >= 1536)
        返回eth->h_proto;
 
    rawp = skb->数据;
 
    if (*(无符号短*)rawp == 0xFFFF)
        返回 htons(ETH_P_802_3);
 
    /*
     * 真正的 802.2 LLC
     */
    返回 htons(ETH_P_802_2);
}
    if (ntohs(eth->h_proto) >= 1536)
        return eth->h_proto;
 
    rawp = skb->data;
 
    if (*(unsigned short *)rawp == 0xFFFF)
        return htons(ETH_P_802_3);
 
    /*
     *    Real 802.2 LLC
     */
    return htons(ETH_P_802_2);
}

如果大于 1,536 的值被解释为协议 ID,设备驱动程序如何找到它接收到的帧的大小?在这两种情况下,无论协议/长度值小于 1,500 还是大于 1,536,设备本身都会将帧的大小存储到其寄存器中,设备驱动程序可以在其中读取它。借助用于此目的的众所周知的位模式,设备可以计算出每个帧的大小。drivers/net/3c59x.cvortex_rx中的以下代码显示了驱动程序如何首先从设备读取大小,然后相应地分配缓冲区:

If values bigger than 1,536 are interpreted as protocol IDs, how does a device driver find the size of the frames it receives? In both cases, whether protocol/length values are less than 1,500 or greater than 1,536, it is the device itself that stores the size of the frame into one if its registers, where the device driver can read it. Devices can figure out the size of each frame thanks to well-known bit patterns used for that purpose. The following piece of code from vortex_rx in drivers/net/3c59x.c shows how the driver first reads the size from the device and then allocates a buffer accordingly:

            /* 数据包长度:最大4.5K!。*/
            int pkt_len = rx_status & 0x1fff;
            结构体sk_buff *skb;
 
            skb = dev_alloc_skb(pkt_len + 5);
            /* The packet length: up to 4.5K!. */
            int pkt_len = rx_status & 0x1fff;
            struct sk_buff *skb;
 
            skb = dev_alloc_skb(pkt_len + 5);

不要对前面代码中的注释感到困惑。该特定设备可以接收大小高达 4.5 K 的帧,因为它也处理 FDDI NIC。

Do not get confused by the comment in the previous code. This particular device can receive frames up to 4.5 K in size because it handles FDDI NICs, too.

我们在第一章中了解了什么是主机和网络字节顺序。由 返回的值eth_type_trans以及分配给 的值skb->protocol采用网络字节顺序:当从以太网标头中提取它时,它已经采用网络字节顺序,并且当eth_type_trans使用本地符号时,ETH_P_ XXX需要显式地将其从主机字节顺序转换使用htons宏来调整网络字节顺序。这也意味着当内核skb->protocol 稍后访问并将其与ETH_P_ XXX值进行比较时,它必须转换ETH_P_ XXX为网络字节顺序或skb->protocol主机字节顺序:使用什么顺序并不重要,重要的是比较的双方都以相同的顺序表示。换句话说,这两行是等价的:

We saw in Chapter 1 what host and network byte order are. The value returned by eth_type_trans, and therefore the value assigned to skb->protocol, is in network byte order: when it is extracted from the Ethernet header it is already in network byte order, and when eth_type_trans uses a local symbol ETH_P_ XXX it needs to explicitly convert it from host byte order to network byte order with the htons macro. This also means that when the kernel accesses skb->protocol later and compares it against an ETH_P_ XXX value, it has to convert either ETH_P_ XXX to network byte order or skb->protocol to host byte order: it does not matter what order is used, it just matters that both sides of the comparison are expressed in the same order. In other words, these two lines are equivalent:

ntohs(skb->协议) == ETH_P_802_2
skb->协议== htons(ETH_P_802_2)
ntohs(skb->protocol) == ETH_P_802_2
skb->protocol == htons(ETH_P_802_2)

由于eth_type_trans仅针对以太网帧调用,因此其他媒体类型也有类似的功能,有些名称以 结尾 _type_trans,有些名称以其他名称结尾。例如,下面的示例显示了从 IBM 令牌环驱动程序 ( drivers/net/tokenring/ibmtr.c )中获取的一些代码,然后再通过 来设置熟悉的netif_rx,skb->protocol调用tr_type_trans,就像eth_type_trans 对以太网设备所做的那样:

Since eth_type_trans is called only for Ethernet frames, there are similar functions for other media types, some with names ending in _type_trans and some with other names. The following example, for instance, shows a bit of code taken from the IBM Token Ring driver (drivers/net/tokenring/ibmtr.c), before the familiar invocation of netif_rx, skb->protocol is set by tr_type_trans, just as eth_type_trans did for Ethernet devices:

静态无效tr_rx(结构设备* dev)
{
    ...
    skb->协议=tr_type_trans(skb, dev);
    ...
    netif_rx(skb);
    ...
}
static void tr_rx(struct device *dev)
{
    ...
    skb->protocol=tr_type_trans(skb, dev);
    ...
    netif_rx(skb);
    ...
}

如果您查看tr_type_transnet /802/tr.c,您将看到与 的逻辑类似的逻辑 eth_type_trans,但应用于令牌环设备。

If you look at tr_type_trans in net/802/tr.c, you will see logic similar to that of eth_type_trans, but applied to Token Ring devices.

还有一些媒体类型可以skb->protocol直接设置,无需任何辅助功能_type_trans,因为它们只能承载一种协议(即 IrDA、AX25 等)。

There are also media types that set skb->protocol directly without any helper function of the _type_trans variety, since they can carry only one protocol (i.e., IrDA, AX25, etc.).

逻辑链路控制 (LLC)

Logical Link Control (LLC)

LLC 层是由 IEEE 802 委员会在标准化 LAN 时设计的。这个想法是,与其使用单一的高层协议标识符,不如为源指定一个协议标识符 (SSAP),为目标指定另一个协议标识符 (DSAP),这样会更加灵活。在大多数情况下,对于任何给定连接,SSAP 和 DSAP 都是相同的(事实上,当设置全局标志时,SSAP 和 DSAP 始终相同),但是拥有两个单独的值使系统可以灵活地使用不同的协议。

The LLC layer was designed by the IEEE 802 committee when they standardized LANs. The idea was that instead of having a single higher-layer protocol identifier, it would be more flexible to specify one protocol identifier for the source (SSAP) and another for the destination (DSAP). In most cases, SSAP and DSAP would be the same for any given connection—in fact, SSAP and DSAP are always the same when the global flag is set—but having two separate values gives systems the flexibility to use different protocols.

LLC可以为其上层提供不同的服务类型:

LLC can provide its upper layer different service types:

Ⅰ型
Type I

无连接(即数据报协议),不支持确认、流量控制和错误恢复

Connectionless (i.e., datagram protocol), with no support for acknowledgments, flow control, and error recovery

II型
Type II

面向连接,支持确认、流量控制和错误恢复

Connection oriented, with support for acknowledgments, flow control, and error recovery

Ⅲ型
Type III

无连接,但具有 II 型的一些优点

Connectionless, but with some of the benefits of type II

图13-8(c)显示了使用LLC的帧的报头格式。正如您所看到的,存在三个新字段:

Figure 13-8(c) shows the header format of a frame using LLC. As you can see, there are three new fields:

安全性评估计划
SSAP

数字化SAP
DSAP

这些是 8 位字段,用于指定所使用的协议。

These are 8-bit fields for specifying the protocols used.

控制(CTL)
Control (CTL)

该字段的大小取决于所使用的 LLC 类型(I 型或 II 型)。我不会详细介绍 LLC 层,但会假设该字段的长度为 1 字节,值为 0x03(类型 I,CTL=UI)。这足以理解本章的其余部分。

The size of this field depends on the type of LLC used (type I or type II). I will not go into details on the LLC layer, but will assume this field to be 1 byte long and have the value 0x03 (type I, CTL=UI). This is sufficient for understanding the rest of the chapter.

由于多种原因,LLC 标头并未流行。也许主要原因是 SSAP 和 DSAP 标识符的 8 位限制,以及为单播/多播和本地/全局标志保留其中两个位的情况。[ * ]剩下的 6 位只能指定 64 个协议,这太有限了。

The LLC header did not prove popular for several reasons. Perhaps the main reason is the 8-bit limit on the SSAP and DSAP identifiers, compounded by reserving two of these bits for the unicast/multicast and local/global flags.[*] Only 64 protocols could be specified in the remaining 6 bits, which was too limiting.

当使用本地 SAP(由协议字段中的本地/全局标志表示)时,网络管理员必须确保所有系统都同意它们使用的本地 SAP,这使得事情变得复杂且可用性较差。全局 SAP 不可能存在歧义,但全局 SAP 并未用于新协议。在下一节中,您将看到如何通过使用 SNAP 概念扩展标头来解决此限制。

When using local SAPs (indicated by the local/global flag in the protocol field), the network administrator must make sure all the systems agree on the local SAPs they use, which makes things complicated and less usable. Ambiguity is not possible for global SAP, but global SAP is not being used for new protocols. In the next section, you will see how this limitation was solved by extending the header with the concept of SNAP.

表 13-3显示了在 Linux 内核中注册的 SAP。与表 13-2中列出的并在dev_add_pack.

Table 13-3 shows the SAPs registered with the Linux kernel. LLC causes the kernel to use an extra level of indirection when retrieving the handler, compared to the protocols listed in Table 13-2 and registered with dev_add_pack.

表 13-3。内核的 802.2 SAP 客户端

Table 13-3. The kernel's 802.2 SAP clients

协议

Protocol

树液

SAP

函数处理程序

Function handler

SNAP

SNAP

0xAA

0xAA

snap_rcv

snap_rcv

IPX

IPX

0xE0

0xE0

ipx_rcv

ipx_rcv

IPX案例

The IPX case

您可能想知道是否可以使用纯802.3帧格式,因为图13-8(b)中没有指示协议ID 。事实上,纯802.3帧通常不被使用。一个众所周知的例外涉及 IPX。IPX 数据包可以使用原始 802.3 帧(即没有 LLC 标头的帧)发送。接收者通过黑客手段识别它们。IPX 标头的第一个字段是 16 位校验和,通常只需将其设置为 0xFFFF 即可将其关闭。自 0xFF/0xFF [ ]是无效的 SSAP/DSAP 组合,并且没有具有该值的 Ethertype,可以轻松识别使用原始 802.3 的 IPX 数据包。当检测到它们时, skb->protocol设置为ETH_P_802_3,其处理程序是 IPX 处理程序(见表13-1)。

You may wonder whether a pure 802.3 frame format can be used, given that there is no indication of a protocol ID in Figure 13-8(b). In fact, pure 802.3 frames are not normally used. The one well-known exception involves IPX. IPX packets can be sent using raw 802.3 frames (that is, frames without an LLC header). The receiver recognizes them by means of a hack. The first field of an IPX header is a 16-bit checksum, which normally is turned off by simply setting it to 0xFFFF. Since 0xFF/0xFF[] is an invalid SSAP/DSAP combination and there is no Ethertype with that value, IPX packets using raw 802.3 can be easily recognized. When they are detected, skb->protocol is set to ETH_P_802_3, whose handler is the IPX handler (see Table 13-1).

Linux 的 LLC 实现

Linux's LLC implementation

802.2 LLC 层在 2.5 开发周期中进行了扩展和重写。内核的 LLC 实现支持类型 I 和 II,由以下主要组件组成:

The 802.2 LLC layer was expanded and rewritten during the 2.5 development cycle. The kernel's LLC implementation, which supports types I and II, consists of the following main components:

  • 两个状态机。它们用于跟踪本地 SAP 的状态以及在它们之上创建的连接。

  • Two state machines. These are used to keep track of the states of the local SAPs and the connections created on top of them.

  • LLC 接收例程根据接收到的输入帧向两个状态机提供正确的输入。

  • An LLC receive routine that feeds the right input to the two state machines based on the input frames it receives.

  • 套接字AF_LLC接口。这可用于在 LLC 层之上的用户空间中构建协议或服务。

  • The AF_LLC socket interface. This can be used to build protocols or services in user space on top of the LLC layer.

由于本书中描述的协议均未使用 LLC 层,因此我不会详细介绍 LLC 服务的定义(对此可以参考 IEEE 802.2 逻辑链路控制规范 [*])不会查看Linux 内核 LLC 实现的详细信息。这里我们只会看到使用什么数据结构来定义本地 SAP 以及如何简单地处理入口帧。

Because none of the protocols described in this book uses the LLC layer, I will not go into detail on the definitions of the LLC services (you can refer to the IEEE 802.2 Logical Link Control specification for this[*]), nor will I look at the details of the Linux kernel's LLC implementation. Here we will only see what data structure is used to define a local SAP and briefly how ingress frames are handled.

用于定义本地 SAP 的数据结构是,它在include/net/llc.hllc_sap中定义。其领域包括:

The data structure used to define a local SAP is llc_sap, which is defined in include/net/llc.h. Among its fields are:

struct llc_addr laddr
struct llc_addr laddr

SAP 标识符。

SAP identifier.

int (*rcv_func)(struct sk_buff *, struct net_device *, struct packet_type *)
int (*rcv_func)(struct sk_buff *, struct net_device *, struct packet_type *)

函数处理程序。当通过套接字打开 SAP 时PF_LLC,该字段为 NULL。当SAP被内核打开时,该字段指向内核提供的例程(见表13-3)。

Function handler. When an SAP is opened via PF_LLC socket, this field is NULL. When the SAP is opened by the kernel, this field points to the routine provided by the kernel (see Table 13-3).

本地 SAP 使用 和 来创建llc_sap_open,并插入到llc_sap_list列表中。llc_sap_open调用以创建两种类型的 SAP:

Local SAPs are created with llc_sap_open, and are inserted into the llc_sap_list list. llc_sap_open is called to create two types of SAP:

  • 那些由内核本身安装的用于安装内核级处理程序[ ](参见表13-3)。

  • Those installed by the kernel itself to install kernel-level handlers[] (see Table 13-3).

  • 那些通过PF_LLC套接字管理的(例如,当服务器使用套接字bind上的系统调用PF_LLC将其绑定到给定的 SAP 时)。

  • Those managed with PF_LLC sockets (for example, when a server uses the bind system call on a PF_LLC socket to bind it to a given SAP).

处理入口 LLC 帧

Processing ingress LLC frames

每当传入帧被分类为eth_type_trans使用 LLC 标头(因为它的类型/长度字段小于 1,536 并且未检测到特殊 IPX 情况)时,skb->protocolto的初始化ETH_P_802_2将导致llc_rcv处理程序的选择(参见表13) -1)。该处理程序将根据 LLC 标头中的 DSAP 字段选择正确的协议处理程序:为此,它调用为内核打开的 SAP 注册的rcv_func处理程序,并在 SAP 被打开时将正确的输入提供给正确的状态机。llc_sap_openPF_LLC套接字打开(见图13-10)。

Whenever an incoming frame is classified by eth_type_trans as using the LLC header (because it has a type/length field that is less than 1,536 and no special IPX case is detected), the initialization of skb->protocol to ETH_P_802_2 leads to the selection of the llc_rcv handler (see Table 13-1). This handler will select the right protocol handler based on the DSAP field in the LLC header: to do so, it calls the rcv_func handler registered with llc_sap_open for those SAPs opened by the kernel, and feeds the right input to the right state machine when the SAPs were opened with a PF_LLC socket (see Figure 13-10).

llc_rcv 函数

图 13-10。llc_rcv 函数

Figure 13-10. The llc_rcv function

当两个状态机之一需要时(例如,确认帧的接收),帧被发送到给定的 SAP。PF_LLC套接字可以使用标准接口(即sendmsg等)进行传输。dev_queue_xmit在这两种情况下,一旦适当的链路层标头被正确初始化,帧就会被直接馈送到。

Frames are sent out a given SAP when one of the two state machines requires it (for example, to acknowledge the reception of a frame). PF_LLC sockets can use the standard interface (i.e., sendmsg, etc.) to transmit. In both cases, frames are fed directly to dev_queue_xmit once the appropriate link layer headers have been initialized properly.

子网访问协议 (SNAP)

Subnetwork Access Protocol (SNAP)

鉴于 LLC 标头的局限性,802 委员会进一步推广了数据链路标头。为了使协议域更大,他们引入了SNAP的概念。基本上,当 SSAP/DSAP 对分配值 0xAA/0xAA 时,它具有特殊含义:LLC 标头的 CTL 字段后面的五个字节表示协议标识符。单播/多播和本地/全局位也不再使用。因此,协议标识符的大小从 8 位跃升至 40 位。委员会决定使用 5 个字节的原因与如何从 MAC 地址派生协议号有关。[ * ]与 SSAP/DSAP 不同,SNAP 代码的使用非常常见。

Given the limitations of the LLC header, the 802 committee generalized the data link header further. To make the protocol domain bigger, they introduced the concept of SNAP. Basically, when the SSAP/DSAP couple is assigned the value 0xAA/0xAA, it has a special meaning: the five bytes following the CTL field of the LLC header represent a protocol identifier. The unicast/multicast and local/global bits are also not used anymore. Thus, the size of the protocol identifier has jumped from 8 bits to 40. The reason the committee decided to use five bytes has to do with how protocol numbers are derived from MAC addresses.[*] Unlike SSAP/DSAP, the use of SNAP codes is pretty common.

由于SNAP标识符0xAA/0xAA是SSAP/DSAP的特殊情况,如表13-3所示 ,它是使用的客户端之一llc_sap_open(参见snap_initnet /802/psnap.c)。这意味着使用 SNAP 代码的协议将具有另一个间接级别,这意味着其中三个!

Since the SNAP identifier 0xAA/0xAA is a special case of SSAP/DSAP, as shown in Table 13-3, it is one of the clients that use llc_sap_open (see snap_init in net/802/psnap.c). This means that a protocol using a SNAP code will have another level of indirection, which means three of them!

在了解 SNAP 客户端如何向内核注册之前,我们先简要了解一下 SNAP 协议 ID 是如何定义的。您可能知道,MAC 地址由 IEEE 管理,IEEE 以 2 24块的形式出售它们。由于 MAC 地址长 48 位(6 个字节),IEEE 只需为每个客户端提供一个 24 位数字(MAC 地址的前三个字节),并让客户端使用剩余 24 位的任意值。假设我想购买一大块 MAC 地址,因为我想开始销售网卡。我们将拨打分配给我的号码 XX:YY:ZZ。那时,我将成为 XX:YY:ZZ:00:00:00 和 XX:YY:ZZ:FF:FF:FF 之间所有地址的所有者。和那些 2 24一起MAC 地址,我将被分配XX:YY:ZZ:00:00 和 XX:YY:ZZ:FF:FF 之间的所有 2 16 SNAP 代码。

Before looking at how SNAP clients register with the kernel, let's briefly see how a SNAP protocol ID is defined. As you probably know, MAC addresses are managed by the IEEE, which sells them in chunks of 224. Since a MAC address is 48 bits long (6 bytes), the IEEE simply has to give each client a 24-bit number (the first three bytes of a MAC address) and let the client use any value for the remaining 24 bits. Suppose I want to buy a chunk of MAC addresses because I want to start selling network cards. We'll call the number assigned to me XX:YY:ZZ. At that point, I would become the owner of all the addresses between XX:YY:ZZ:00:00:00 and XX:YY:ZZ:FF:FF:FF. Together with those 224 MAC addresses, I would be assigned all the 216 SNAP codes between XX:YY:ZZ:00:00 and XX:YY:ZZ:FF:FF.

实际上,当您从 IEEE 获得一个 24 位数字时,由于全局/本地和单播/多播位的四种可能组合,它会为您提供四个 24 位数字(见图 13-9

Effectively, when you get a 24-bit number from the IEEE, it offers you four 24-bit numbers thanks to the four possible combinations of the global/local and unicast/multicast bits (see Figure 13-9).

与 SAP 协议注册和取消注册的方式类似,SNAP 层提供了register_snap_clientunregister_snap_client函数,它们也使用全局列表 ( snap_list) 将所有向内核注册的 SNAP 协议链接在一起。Linux内核注册的客户端如表13-4所示。

Similar to the way SAP protocols are registered and unregistered, the SNAP layer provides the register_snap_client and unregister_snap_client functions, which also use a global list (snap_list) to link together all the SNAP protocols registered with the kernel. Table 13-4 shows the clients registered with the Linux kernel.

表 13-4。快照客户端

Table 13-4. SNAP client

协议

Protocol

快照 ID

Snap ID

函数处理程序

Function handler

AppleTalk 地址解析协议

AppleTalk Address Resolution Protocol

00:00:00:80:F3

00:00:00:80:F3

aarp_rcv

aarp_rcv

AppleTalk 数据报传送协议

AppleTalk Datagram Delivery Protocol

08:00:07:80:9B

08:00:07:80:9B

atalk_rcv

atalk_rcv

IPX

IPX

00:00:00:81:37

00:00:00:81:37

ipx_rcv

ipx_rcv

用于定义 SNAP 协议的数据结构是,在include/net/datalink.hdatalink_proto中定义。在其领域中,您有:

The data structure used to define a SNAP protocol is datalink_proto, defined in include/net/datalink.h. Among its fields, you have:

unsigned short header_length
unsigned short header_length

这是数据链路头的长度。它被初始化为8 in register_snap_client(参见图13-8(d))。

This is the length of the data link header. It is initialized to 8 in register_snap_client (see Figure 13-8(d)).

unsigned_char type[8]
unsigned_char type[8]

协议标识符。仅使用五个字节(SNAP 协议 ID;请参阅表 13-4)。

Protocol identifier. Only five bytes are used (the SNAP protocol ID; see Table 13-4).

void (*request)(struct datalink_proto *, struct sk_buff *, unsigned char *)
void (*request)(struct datalink_proto *, struct sk_buff *, unsigned char *)

初始化为snap_requestregister_snap_client. 它初始化 SNAP 标头(仅协议 ID)并将帧传递给 802.2 代码。它在传输之前调用以填充数据链路标头。

Initialized to snap_request in register_snap_client. It initializes the SNAP header (protocol ID only) and passes the frame to the 802.2 code. It is invoked before a transmission to fill in the data link header.

void (*rcvfunc)(struct sk_buff *, struct net_device *, struct packet_type *)
void (*rcvfunc)(struct sk_buff *, struct net_device *, struct packet_type *)

入口流量的函数处理程序。参见表13-4

Function handler for ingress traffic. See Table 13-4.

我将重点关注 IPX。值得指出的是,该协议在三个不同的点向内核注册相同的处理程序:

I'll focus for just a moment on IPX. It's worth pointing out that this protocol registers the same handler with the kernel at three different points:

图 13-11总结了内核如何识别和处理以太网、802.3、802.2 和 SNAP 帧。

Figure 13-11 summarizes how the kernel recognizes and handles Ethernet, 802.3, 802.2, and SNAP frames.

以太网/802.3/802.2/SNAP 帧的协议检测

图 13-11。以太网/802.3/802.2/SNAP 帧的协议检测

Figure 13-11. Protocol detection for Ethernet/802.3/802.2/SNAP frames

通过 /proc 文件系统进行调整

Tuning via /proc Filesystem

对于以太网和 802 在/proc/sys/net/proc/sys/net/ethernet/(为空)和/proc/sys/net/token-ring/(包含单个文件)中有一个目录,分别注册在文件net/core/sysctl_net_ethernet.cnet/802/sysclt_net_802.c中。仅当内核编译为支持以太网和令牌环时才包含这两个目录, 分别。

For both Ethernet and 802 there is a directory in /proc/sys/net, /proc/sys/net/ethernet/ (which is empty) and /proc/sys/net/token-ring/ (which includes a single file), registered respectively in the files net/core/sysctl_net_ethernet.c and net/802/sysclt_net_802.c. These two directories are included only when the kernel is compiled with support for Ethernet and Token Ring , respectively.

本章介绍的函数和变量

Functions and Variables Featured in This Chapter

表13-5列出了本章介绍的主要函数,以及最重要的全局变量和数据结构。

Table 13-5 lists the main functions introduced in this chapter, together with the most important global variables and data structures.

表 13-5。协议处理程序管理涉及的函数和数据

Table 13-5. Functions and data involved in protocol handler management

姓名

Name

描述

Description

功能

Function

 

dev_add_pack

dev_add_pack

dev_remove_pack

dev_remove_pack

添加/删除协议处理程序。

Add/remove a protocol handler.

register_8022_client

register_8022_client

unregister_8022_client

unregister_8022_client

注册/取消注册 802.2 协议。它们被定义为 llc_sap_open和的包装器llc_sap_close

Register/unregister an 802.2 protocol. They are defined as wrappers around llc_sap_open and llc_sap_close.

register_snap_client

register_snap_client

unregister_snap_client

unregister_snap_client

注册/取消注册 SNAP 客户端。

Register/unregister a SNAP client.

llc_sap_open

llc_sap_open

llc_sap_close

llc_sap_close

创建/删除 SAP。

Create/remove an SAP.

eth_type_trans

eth_type_trans

以太网设备用来提取高层协议标识符并将帧分类为单播/多播/广播。

Used by Ethernet devices to extract the higher-layer protocol identifier and classify the frame as unicast/multicast/broadcast.

变量

Variables

 

netdev_nit

netdev_nit

注册的协议嗅探器的数量。

Number of protocol sniffers registered.

ptype_base

ptype_base

指向包含已注册协议处理程序的数据结构的指针。

Pointer to the data structure containing the registered protocol handler.

ptype_all

ptype_all

与协议嗅探器相同ptype_base但适用于协议嗅探器。

Same as ptype_base but applied to protocol sniffers.

snap_list

snap_list

SNAP 客户端列表。

List of SNAP clients.

数据结构类型

Data structure type

 

struct packet_type

struct packet_type

用于存储有关ETH_P_ XXX协议处理程序的信息。

Used to store information about an ETH_P_ XXX protocol handler.

struct datalink_proto

struct datalink_proto

用于表示 SNAP 协议。

Used to represent a SNAP protocol.

本章介绍的文件和目录

Files and Directories Featured in This Chapter

图 13-12显示了本章提到的文件的位置。在include/linux中,您可以找到其他媒体类型的if_xxx.h头文件。net /llc目录还包含几个文件。

Figure 13-12 shows the location of the files mentioned in this chapter. In include/linux you can find if_xxx.h header files for other media types. The net/llc directory includes several more files.

本章介绍的文件和目录

图 13-12。本章介绍的文件和目录

Figure 13-12. Files and directories featured in this chapter




[ * ]有关这两种模型的更多信息,我建议您阅读《计算机网络,第二版》(Prentice Hall)。

[*] For more information on these two models, I suggest Computer Networks, Second Edition (Prentice Hall).

[ * ]该图仅显示了我们讨论所需的细节。

[*] The figure shows only the details needed for our discussion.

[ * ]我们将在第五部分中看到,只有更改的字段,例如生存时间 (TTL) 和校验和,才需要更新。

[*] We will see in Part V that only changing fields, such as Time To Live (TTL) and checksum, need to be updated.

[ * ]该图显示ETH_P_ALL协议类型使用packet_rcv例程。由net/packet/af_packet.c根据内核配置func进行初始化。packet_create

[*] The figure shows that the ETH_P_ALL protocol types use the packet_rcv routine. func is initialized by packet_create in net/packet/af_packet.c based on the kernel configuration.

[ * ]这样安排的原因说来话长。出于好奇,我建议阅读互连,第二版:桥、路由器、交换机和网络协议(Addison Wesley),其中作者以相当讽刺的方式对其进行了解释。

[*] The reason for this arrangement is a long story. For the curious, I suggest reading Interconnections, Second Edition: Bridges, Routers, Switches, and Internetworking Protocols (Addison Wesley), where the author explains it with considerable irony.

[ ]netif_rx只是设备驱动程序可用于通知上层有关帧接收的两个接口之一。第 10 章对这两者进行了描述。

[] netif_rx is only one of the two interfaces available to device drivers to notify upper layers about the reception of frames. Both of them are described in Chapter 10.

[ * ]尽管代码将其称为数据包类型,但它实际上是帧类型,因为它是从链路层地址派生的。

[*] Even though the code calls it the packet type, it actually is the frame type because it is derived from the link layer address.

[ * ]本地 MAC 地址和不可路由的 IP 地址( 192.168.xx等)之间没有关系:它们在概念上类似,但应用于堆栈中的两个不同层。

[*] There is no relationship between local MAC addresses and nonroutable IP addresses (192.168.x.x, etc.): they are similar in concept, but applied to two different layers in the stack.

[ * ]这两个标志的含义与前面讨论的 MAC 地址相同,但这里它适用于协议而不是地址。

[*] The meaning of those two flags is the same as discussed earlier for MAC addresses, but here it applies to protocols rather than addresses.

[ ] Linux 内核中到处都使用了针对 0xFF/0xFF 的检查来识别 IPX 数据包。eth_type_trans就是一个例子。

[] The check against 0xFF/0xFF to recognize IPX packets is used all over the place in the Linux kernel. eth_type_trans is one example.

[ * ]与大多数 IEEE 文档一样,关于 LLC 设计的文档相当大,读起来并不有趣。然而,有了这份文档,浏览 LLC 代码就会容易得多,尤其是浏览状态机的无聊细节。

[*] Like most IEEE documents, the one about the LLC design is pretty big and not fun to read. However, with this document in your hands, it will be much easier to go through the LLC code, especially through the boring details of the state machines.

[ ]这也可以通过例程间接完成register_8022_client

[] This can be accomplished indirectly via the register_8022_client routine, too.

[ * ] SNAP 代码被定义为 MAC 地址的子集,由 IEEE 以块的形式出售。这样,每个 MAC 地址所有者都会与 MAC 地址一起分配给她许多 SNAP 代码。有关详细信息,我建议阅读 《互连》第二版( Addison Wesley)。

[*] SNAP codes are defined as a subset of MAC addresses, which are sold by IEEE in chunks. This way, each MAC address owner has a number of SNAP codes assigned to her together with the MAC addresses. For details, I recommend reading Interconnections, Second Edition (Addison Wesley).

第四部分。桥接

Part IV. Bridging

在 L3 层,IPv4 等协议通过第七部分中介绍的路由子系统连接不同的网络。在本书的这一部分中,我们将研究路由的链路层或 L2 对应部分: 桥接。尤其:

At the L3 layer, protocols such as IPv4 connect different networks through the routing subsystem laid out in Part VII. In this part of the book, we will look at the link layer or L2 counterpart of routing: bridging. In particular:

第 14 章 桥接:概念
Chapter 14 Bridging: Concepts

引入透明学习和选择性转发的概念。

Introduces the concepts of transparent learning and selective forwarding.

第 15 章 桥接:生成树协议
Chapter 15 Bridging: The Spanning Tree Protocol

展示生成树协议 (STP) 如何解决大部分桥接限制,最后概述了最新的 STP 增强功能(尚不适用于 Linux)。

Shows how the Spanning Tree Protocol (STP) solves most of bridging's limitations, and concludes with an overview of the latest STP enhancements (not yet available for Linux).

第 16 章 桥接:Linux 实现
Chapter 16 Bridging: Linux Implementation

展示 Linux 如何实现桥接和 STP。

Show how Linux implements bridging and STP.

第 17 章 桥接:杂项主题
Chapter 17 Bridging: Miscellaneous Topics

最后概述了桥接代码如何与其他网络子系统交互,并详细描述了桥接代码使用的数据结构。

Concludes with an overview of how the bridging code interacts with other networking subsystems and a detailed description of the data structures used by the bridging code.

第 14 章桥接:概念

Chapter 14. Bridging: Concepts

在关于桥接的第一章中,我们将看到桥是什么设备是什么、如何使用以及有哪些限制。我将特别描述透明桥接、地址学习以及所谓的转发数据库的使用。我将在本章结束时解释为什么网桥不能在环路拓扑中使用,我将介绍下一章,我们将看到生成树协议 (STP) 如何解决这一限制。其他形式的桥接也是可用的,但它们很少使用,也没有在 Linux 内核中实现。

In this first chapter on bridging, we will see what a bridge device is, how it is used, and what limitations it comes with. In particular, I'll describe transparent bridging, address learning, and the use of the so-called forwarding database. I'll conclude the chapter with an explanation of why bridges cannot be used on loop topologies and I will introduce the next chapter, where we will see how the Spanning Tree Protocol (STP) can address this limitation. Other forms of bridging are available, but they are rarely used and not implemented in the Linux kernel.

本章中使用的网络拓扑不一定代表真实案例场景;他们的选择仅基于教学原则。

The network topologies used in this chapter do not necessarily represent real case scenarios; they are selected based only on didactic principles.

中继器、网桥和路由器

Repeaters, Bridges, and Routers

在介绍桥接之前,我先澄清一下转发数据包的不同网络设备之间的区别:中继器、网桥和路由器。差异如图 14-1所示:

Before introducing bridging, I will clarify the distinction between different network devices that forward packets: repeaters, bridges, and routers. The differences are illustrated in Figure 14-1:

  • 中继 是一种设备,通常配备两个端口,只需将一个端口上接收到的内容复制到另一个端口,反之亦然。它逐位复制数据;它不了解任何协议,因此无法区分不同的帧或数据包。如今,中继器很少被使用,因为网桥已经变得相当便宜,并且提供了更好的功能,证明了成本差异的合理性。多端口中继器称为 集线器

  • A repeater is a device, typically equipped with two ports, that simply copies what it receives on one port to the other, and vice versa. It copies data bit by bit; it does not have any knowledge of protocols, and therefore cannot distinguish among different frames or packets. Repeaters are rarely used nowadays, because bridges have become pretty affordable and provide better capabilities that justify the cost difference. Multiport repeaters are called hubs.

  • 与中继器不同,网桥理解链路层协议,因此逐帧复制数据,而不是逐位复制数据。这意味着网桥必须能够在每个端口至少缓冲一帧。大多数 LAN 都是通过网桥(更常见的称为交换机;请参阅“网桥与交换机”部分)来实现的。这个装置是本章的主角。

  • Unlike a repeater, a bridge understands link layer protocols and therefore copies data frame by frame, instead of bit by bit. This means that a bridge must be able to buffer at least one frame per port. Most LANs are implemented with bridges (that more commonly are called switches; see the section "Bridges Versus Switches"). This device is the main protagonist of this chapter.

  • 路由器一个 是一种能够理解 IP 等 L3 网络协议并根据路由表转发入口数据包的设备。术语 网关 在router之前介绍过 ,也常用来指代同类设备。本书的第七部分详细介绍了Linux如何实现路由。

  • A router is a device that understands L3 network protocols such as IP, and forwards ingress packets based on a routing table. The term gateway , which was introduced before router, is also commonly used to refer to the same kind of device. Part VII of this book goes into detail on how Linux implements routing.

(a) 复读机; (b) 桥梁; (c) 路由器

图 14-1。(a) 复读机;(b) 桥梁;(c) 路由器

Figure 14-1. (a) Repeater; (b) bridge; (c) router

图14-1(b)显示了所谓的存储转发桥,这是 Linux 使用的方案:以太网帧仅在完整接收后才从正确的端口复制出来。

Figure 14-1(b) shows what is called a store-and-forward bridge , which is the scheme used by Linux: Ethernet frames are copied out of the right ports only after they have been received in their entirety.

其他方案也是可能的。例如,一种非常常见的称为直通的技术,一旦收到足够的入口帧以识别目标端口,就会开始将帧复制到正确的端口。在本章末尾,当我们了解什么是地址学习时,正确端口的含义将会变得更加清晰。此方案速度更快,因为它更早开始复制,但它无法丢弃损坏的入口帧。[ * ]该方案还需要网络接口卡(NIC)硬件的一些配合。在当前模型中,NIC 将整个帧传递给设备驱动程序。

Other schemes are possible. For example, a pretty common one called cut-through starts copying frames to the right ports as soon as it has received enough of the ingress frame to identify the destination ports. The meaning of right ports will become clearer at the end of this chapter, when we will have seen what address learning is. This scheme is faster because it starts copying earlier, but it cannot discard corrupted ingress frames.[*] The scheme also requires some cooperation from the network interface card (NIC) hardware. In the current model, NICs pass whole frames to the device drivers.

网桥为其每个接口分配一个链路层地址,并转发通过它但未寻址到它的任何内容。(路由器在 L3 级别的行为类似,我们将在第七部分中看到。)但是是否有任何帧都会发送到网桥的接口?毕竟,网桥的全部意义在于帮助帧到达其他目的地。然而,在以下两种情况下,网桥确实会消耗一些入口帧:

A bridge assigns a link layer address to each of its interfaces, and forwards anything that passes through it but is not addressed to it. (Routers act similarly at the L3 level, as we'll will see in Part VII.) But would any frame be addressed to the bridge's interface? After all, the whole point of a bridge is to help frames get to other destinations. However, a bridge does consume some ingress frames, under two conditions:

将其传递到上层(即 L3)
To pass it to the upper (i.e., L3) layer

仅当网桥也实现 L3 功能(即,除了网桥之外,它还是路由器或主机)并且入口帧寻址到接收接口上配置的 L2 地址时,这才有可能。

This is possible only when the bridge implements L3 functionalities, too (i.e., it is a router or host in addition to a bridge) and the ingress frame is addressed to the L2 address configured on the receiving interface.

将其传递给协议处理程序
To pass it to a protocol handler

我们将在第 15 章 介绍生成树协议时看到这种情况。

We will see this case in Chapter 15 when I will introduce the Spanning Tree Protocol.

桥接器与交换机

Bridges Versus Switches

术语“桥”“交换机”可用于指代同一设备。然而,如今,术语“桥”主要用于讨论桥的行为以及 STP(我们将在下一章中看到)如何工作的文档(例如本章末尾引用的 IEEE 规范)。相反,对实际设备的引用通常使用术语“交换机”

The terms bridge and switch can be used to refer to the same device. However, nowadays the term bridge is mainly used in the documentation (such as the IEEE specifications referenced at the end of this chapter) that discusses how a bridge behaves and how the STP (which we will see in the next chapter) works. In contrast, references to the actual devices are usually made with the term switch.

我见过人们使用术语桥接器来指代桥接设备的唯一情况 该设备仅配备两个端口(并且具有两个端口的桥接器现在并不常见)。这就是为什么我经常将交换机非正式地定义为多端口桥。除非您熟悉 IEEE 官方文档,否则您可能会使用术语switch。我个人在桥接软件方面工作了多年,据我所知,我仅在处理文档时才使用术语“ 桥接”,而不是指任何网络设置上的设备。

The only cases where I have seen people referring to a bridge device using the term bridge is when the device is equipped with only two ports (and bridges with two ports are not that common nowadays). This is why I often define a switch informally as a multiport bridge. Unless you are familiar with the official IEEE documentation, you will probably use the term switch. I personally worked on bridging software for years, and as far as I can remember, I used the term bridge only when working on the documentation, never to refer to a device on any network setup.

一般来说,我可以说桥接器和交换机确实没有区别。

Generally speaking, I can say that there is really no difference between a bridge and a switch.

如今,桥梁很常见。您可以找到具有可变端口数量(以及匹配价格)的网桥。如今,以太网桥代表了实现 LAN 的最常见方式。

Bridges are pretty common nowadays. You can find bridges with a variable number of ports (and matching prices). An Ethernet bridge, nowadays, represents the most common way to implement a LAN.

对于运行 Linux 的 PC,您可以通过安装多个 NIC 来实现桥接。您还可以找到多端口 市场上的外围组件互连 (PCI) NIC 允许您拥有比 PCI 插槽更多的网络接口。例如,您可以在单个 PCI NIC 中拥有四个以太网端口。

With a PC running Linux, you can implement a bridge by installing more than one NIC. You can also find multiport Peripheral Component Interconnect (PCI) NICs on the market that allow you to have more network interfaces than PCI slots. You can, for example, have four Ethernet ports in a single PCI NIC.

在本章的其余部分和接下来的章节中,我将坚持使用术语“ 桥”,但现在您知道您在办公室中使用的用于连接 PC 和网络打印机的交换机只不过是一个多端口桥,它可能比纯桥运行更多的软件,例如,提供附加功能。

In the rest of this chapter and in the next ones, I will stick to the term bridge, but now you know that the switch that you are using in your office to connect the PCs and the network printer is nothing but a multiport bridge, which probably runs more software than a pure bridge would run—for example, to provide additional features.

主办方

Hosts

在本节和后续桥接章节(包括路由器)的上下文中,任何在高于网桥使用的网络层(即链路层或 L2)运行的设备都被视为主机。如果配置得当,主机(即 Linux 系统)既可以是桥接器,又可以是用户可以用作标准工作站的主机。但在本章和后续章节中,我们不会考虑这种情况;我们假设主机不运行任何桥接代码。因此,除非文中另有说明,所有图中的 PC 都不运行任何桥接代码。

Any device that operates at a network layer higher than the one used by bridges (i.e., the link layer or L2) is considered a host in the context of this and the following bridging chapters (routers included). A host (i.e., a Linux system) can, if configured appropriately, be both a bridge and a host that a user can use as a standard workstation. But in this chapter and the following ones, we will not consider this case; we'll assume a host does not run any bridging code. Therefore, the PCs in all the figures do not run any bridging code unless stated otherwise in the text.

将 LAN 与网桥合并

Merging LANs with Bridges

让我们以图14-2中的场景为例,看看如何使用网桥来合并两个局域网,使它们看起来像一个。我们假设主机两个 LAN 中的内容被配置为同一 IP 子网的一部分。我们不需要在图中包含 IP 地址,因为我们将重点关注链路层发生的情况。

Let's take the scenario in Figure 14-2 as an example and see how a bridge can be used to merge two LANs and make them look like one. Let's assume the hosts in the two LANs were configured to be part of the same IP subnet. We do not need to include the IP addresses in the figure because we will focus on what happens at the link layer.

您应该注意到,局域网只不过是一个多端口桥。

You should note that a LAN is nothing but a multiport bridge.

任何主机在 LAN 上传输的任何帧都会被所有其他主机接收。因此,当主机 A 发出帧时,LAN1 中的其他主机和网桥都会收到该帧。网桥将其入口帧复制到所有其他端口(在本例中只有一个其他端口)。因此,最后,LAN1 和 LAN2 的所有主机都会收到主机 A 生成的帧的副本。这意味着,由于有网桥,从 LAN1 和 LAN2 的主机的角度来看,只有一个大 LAN。网桥通常用于合并主机配置在同一 IP 子网上的物理 LAN,因为它们给主机带来单一 LAN 的错觉。

Any frame transmitted on a LAN by any host is received by all other hosts. So when Host A sends out a frame, both the other hosts of LAN1 and the bridge receive it. A bridge copies its ingress frame out on all the other ports (there is just one other port, in this case). At the end, therefore, all hosts of both LAN1 and LAN2 receive a copy of the frame generated by Host A. This means that thanks to the bridge, there is only one big LAN from the perspective of the hosts of LAN1 and LAN2. Bridges are commonly employed to merge physical LANs whose hosts are configured on the same IP subnet, because they give the hosts the illusion of a single LAN.

请注意,网桥在接收到入口帧时转发它们。它不会添加、删除或更改其中的任何内容:LAN2 中的主机收到主机 A 生成的原始帧的精确副本。

Note that the bridge forwards the ingress frames as they are received. It does not add, remove, or change anything on them: the hosts in LAN2 receive an exact copy of the original frame generated by Host A.

您可能会说,从主机 A 发送到主机 B 的数据包不必要地转发到 LAN2,这浪费了 LAN2 上的带宽,也浪费了 CPU

You may argue that a packet from Host A addressed to Host B is needlessly forwarded to LAN2, which is a waste of bandwidth on LAN2 and a waste of CPU

两个 LAN 通过网桥合并

图 14-2。两个 LAN 通过网桥合并

Figure 14-2. Two LANs merged with a bridge

LAN2 主机的电源(因为所有主机最终都会丢弃任何未寻址到它们的帧)。通过将两个 LAN 的主机分配到两个不同的 IP 子网并用路由器替换网桥(如图14-3所示),就消除了浪费,因为路由器不会将那些发往已配置主机的数据包转发到 LAN2在 LAN1 上。

power for the hosts of LAN2 (since all of them will end up dropping any frame that is not addressed to any of them). By assigning the hosts of the two LANs to two different IP subnets and replacing the bridge with a router, as in Figure 14-3, the waste is eliminated, because the router does not forward to LAN2 those packets that are addressed to a host configured on LAN1.

通过路由器连接两个 LAN

图 14-3。通过路由器连接两个 LAN

Figure 14-3. Two LANs connected with a router

图 14-214-3的拓扑用于不同的环境。第一个希望位于不同 LAN 中的主机共享相同的 L2,因此共享相同的 IP (L3) 子网。第二个更喜欢将主机隔离在不同的子网上,也许是出于管理原因。

The topologies of Figures 14-2 and 14-3 are used in different contexts. The first one prefers to have hosts that are located in different LANs share the same L2 and therefore the same IP (L3) subnet. The second one prefers to segregate hosts on different subnets, perhaps for administrative reasons.

请注意,图 14-2中的主机 仍然需要路由器才能到达其子网之外的 IP 地址。

Note that the hosts in Figure 14-2 still need a router to reach IP addresses outside their subnet.

桥接不同的 LAN 技术

Bridging Different LAN Technologies

在前面的示例中,我们总是看到一个桥接器的两个端口都连接到以太网 LAN。这种桥接类型是最常用的,主要是因为当今 LAN 的事实标准是以太网。然而,特别是在过去,曾经存在具有不同LAN端口的网桥;例如,以太网端口和令牌环端口。此类桥接器还需要考虑一个问题:桥接 LAN 技术之间的差异。例如,以太网和令牌环 LAN 以不同的速度运行,并使用不同的 L2 协议和标头。不同的速度需要实现某种缓冲,不同的协议需要桥将标头从一种格式转换为另一种格式,包括处理由一个协议提供但另一个协议不提供的 L2 选项。Linux 仅在以太网端口之间桥接,因此我们将不再考虑更复杂的情况。

In the previous examples, we always saw a bridge with both ports connected to Ethernet LANs. This bridge type is the most commonly used, mainly because the de facto standard for LANs nowadays is Ethernet. However, especially in the past, there used to be bridges with different LAN ports; for example, an Ethernet port and a Token Ring port. Such bridges have one more issue to take into consideration: the differences between the bridged LAN technologies. For example, Ethernet and Token Ring LANs operate at different speeds, and use different L2 protocols and headers. The different speeds require some kind of buffering to be implemented, and the different protocols require the bridge to convert headers from one format to the other, including taking care of those L2 options that are provided by one protocol but not by the other. Linux bridges only between Ethernet ports, so we will not consider the more complex case any further.

地址学习

Address Learning

我们在上一节中看到,如果网桥盲目地将入口帧复制到除接收帧之外的所有端口,可能会导致资源浪费。幸运的是,桥梁并不那么盲目。他们实际上能够了解主机的位置(更准确地说是它们的 L2 地址),并利用这些知识有选择地将入口帧仅复制到正确的端口。这个过程称为被动学习,因为它是由网桥单独处理的,不需要用户配置或协议的帮助。让我们借助图14-4看看它是如何工作的。为了使该图更具可读性,该图使用“主机 N”符号来指代 L2 地址(即,它不像图 14-2那样显示真实的 L2 地址)。

We saw in the previous section that a bridge that blindly copies ingress frames to all the ports except the one that received the frame may lead to a waste of resources. Fortunately, bridges are not as blind as that. They actually are able to learn the location of hosts (their L2 addresses, to be more exact) and use that knowledge to selectively copy ingress frames only to the right port. This process is called passive learning, because it is handled by the bridge alone, without any need for user configuration or help from a protocol. Let's see how it works with the help of Figure 14-4. To make the figure more readable, the figure uses the "Host N" notation to refer to L2 addresses (i.e., it does not show real L2 addresses, as Figure 14-2 does).

让我们看看当图 14-4中的主机交换几个帧时会发生什么。请记住,此处讨论的地址是链路层地址(即以太网 MAC 地址):

Let's see what happens when the hosts of Figure 14-4 exchange a few frames. Keep in mind that the addresses discussed here are link layer addresses (i.e., Ethernet MAC addresses):

图14-4(a)
Figure 14-4(a)

主机 A 发送一个寻址到主机 B 的帧。主机 B 接收该帧,因为它位于同一 LAN 上,并且网桥也接收该帧的副本。由于网桥不知道主机 B 位于何处,因此它会复制 LAN2 上的帧。但由于网桥已在网桥的 LAN1 端口上收到来自主机 A 的帧,因此网桥现在知道主机 A 位于 LAN1 中。请注意,这是可能的,因为以太网标头包括源地址和目标地址。

Host A transmits a frame addressed to Host B. Host B receives it, because it sits on the same LAN, and the bridge receives a copy as well. Because the bridge does not know where Host B is located, it copies the frame on LAN2. But because the bridge has received a frame from Host A on the bridge's LAN1 port, it now knows that Host A is located in LAN1. Note that this is possible because Ethernet headers include both the source and destination addresses.

图14-4(b)
Figure 14-4(b)

主机 B 发送一个寻址到主机 A 的帧。主机 A 和网桥都接收该帧。因为网桥知道主机 A 位于 LAN1(它接收帧的同一 LAN)上,所以它不会复制 LAN2 上的帧。

地址学习的例子

图 14-4。地址学习的例子

Host B transmits a frame addressed to Host A. Both Host A and the bridge receive the frame. Because the bridge knows that Host A is on LAN1, the same LAN it received the frame from, it will not copy the frame on LAN2.

Figure 14-4. Examples of address learning

图14-4(c)
Figure 14-4(c)

主机 A 发送一个寻址到主机 C 的帧。主机 B 和网桥都接收该帧。主机 B 因为不是接收者而将其丢弃,而网桥则将帧复制到 LAN2,因为它不知道主机 C 位于何处。网桥已经知道主机 A 在 LAN1 中。因此,它不需要将任何条目添加到可通过 LAN1 上的端口访问的地址列表中。

Host A transmits a frame addressed to Host C. Both Host B and the bridge receive the frame. Host B discards it because it is not the recipient, and the bridge copies the frame to LAN2 because it does not know where Host C is located. The bridge already knows that Host A is in LAN1. Therefore, it does not need to add any entry to the list of addresses reachable through its port on LAN1.

图14-4(d)
Figure 14-4(d)

主机 C 发送一个寻址到主机 A 的帧。主机 D 和网桥都收到一个副本。主机 D 丢弃该帧,因为它不是接收者,网桥将其复制到 LAN1,因为它知道主机 A 位于 LAN1 上。

Host C transmits a frame addressed to Host A. Both Host D and the bridge receive a copy. Host D discards the frame because it is not the recipient, and the bridge copies it to LAN1 because it knows that Host A is located on LAN1.

将帧复制到除接收该帧的接口之外的所有接口的行为称为泛洪,网桥在不知道使用哪个接口到达给定 L2 地址时会使用该接口。

The act of copying a frame out to all interfaces except the one the frame is received from, which is used by bridges when they do not know which interface to use to reach a given L2 address, is called flooding.

广播和组播地址

Broadcast and Multicast Addresses

当网桥收到发送到链路层广播地址 (FF:FF:FF:FF:FF:FF) 或 L2 多播地址的帧时,它会将其复制到除接收它的端口之外的每个端口。组播地址和广播地址不能用作帧中的源地址,因此它们不会被网桥学习并关联到任何特定端口(这将是一个错误)。

When a bridge receives a frame addressed to the link layer broadcast address (FF:FF:FF:FF:FF:FF) or to an L2 multicast address, it copies it to every port except the one it received it from. Multicast addresses and the broadcast address cannot be used as source addresses in a frame, so they will not be learned and associated to any specific port (which would be a mistake) by bridges.

老化

Aging

网桥需要动态更新可通过其接口到达的地址列表;否则,它可能最终无法将帧传送给接收者或不必要地将帧复制到错误的端口。让我们借助图 14-4看几个示例:

A bridge needs to dynamically update the list of addresses reachable through its interfaces; otherwise, it may end up not delivering frames to their recipients or needlessly copying frames to the wrong ports. Let's look at a couple of examples with the help of Figure 14-4:

  • 一旦主机 A 和 B 交换了一些数据,网桥就知道它不需要将这两台主机之间交换的任何帧复制到 LAN2 上(参见图14-4(a)和 (b))。如果出于某种原因将主机 B 移至 LAN2,则网桥的知识将过时:网桥不会将主机 A 生成并寻址到主机 B 的帧转发到 LAN2。但是,一旦主机 B 再次开始通话,网桥就会将其转发到 LAN2。可以了解其新位置并更新其知识。

  • Once Hosts A and B have exchanged some data, the bridge knows that it does not need to copy onto LAN2 any frame exchanged between those two hosts (see Figures 14-4(a) and (b)). If you move Host B to LAN2 for some reason, the bridge's knowledge would be outdated: the bridge will not forward to LAN2 the frames generated by Host A and addressed to Host B. However, as soon as Host B starts talking again, the bridge can learn its new location and update its knowledge.

  • 一旦主机 A 和 C 交换了一些数据,网桥就知道这两台主机位于不同的 LAN 上。因此,它知道从主机 A 生成并寻址到主机 C 的帧需要从生成 LAN 复制到另一 LAN,反之亦然(参见图 14-4(c)(d))。假设主机 C 移动到 LAN1,网桥会继续将帧从主机 A 复制到 LAN2,即使不需要这些帧。一旦主机 C 再次开始通话,网桥就可以更新其地址列表,并将主机 C 的地址从 LAN2 的列表移动到 LAN1 的列表。

  • Once Hosts A and C have exchanged some data, the bridge knows that the two hosts are on different LANs. Therefore, it knows the frames generated from Host A and addressed to Host C need to be copied from the generating LAN to the other one, and vice versa (see Figures 14-4(c) and (d)). Supposing that Host C is moved to LAN1, the bridge would keep copying frames from Host A to LAN2 even if they are not needed. As soon as Host C starts talking again, the bridge can update its lists of addresses and move Host C's address from LAN2's list to LAN1's list.

在这两种情况下,如果主机在移动后没有生成任何帧,则网桥无法了解其新位置。

In both cases, if a host does not generate any frames after it has been moved, the bridge does not have any way of learning its new location.

因此,为了使网桥的知识适应拓扑变化,网桥学习的地址会在可配置的时间后超时。这种老化机制通常通过一个简单的计时器来实现,该计时器在第一次获知地址时启动,并在再次听到主机的任何时候重新启动(重置),确认或更新其地址。流程如图14-5所示。计时器越低,网桥了解变化的速度就越快,但网桥也就越频繁地发现自己不知道给定主机的位置并不得不使用泛洪。默认老化时间为5分钟。我们将在第 15 章看到如何在特定条件下通过 STP 降低老化时间,以及第 17 章中系统管理员如何更改默认老化时间。

Therefore, to adapt the bridge's knowledge to topology changes, the addresses learned by the bridge are timed out after a configurable amount of time. This aging mechanism is usually implemented by a simple timer that is started when the address is first learned, and restarted (reset) anytime the host is heard again, confirming or updating its address. The process is shown in Figure 14-5. The lower the timer is, the faster a bridge can learn about changes, but also the more frequently the bridge finds itself not knowing where a given host is located and having to use flooding. The default aging time is 5 minutes. We will see in Chapter 15 how the aging time can be lowered by the STP under specific conditions, and in Chapter 17 how the system administrator can change the default aging time.

解决学习和老龄化问题

图 14-5。解决学习和老龄化问题

Figure 14-5. Address learning and aging

多座桥梁

Multiple Bridges

到目前为止,我们只看到了只有一座桥的简单场景。然而,由于透明网桥之间是透明的,对于主机和路由器也是透明的,因此可以通过使用多个网桥来创建更大的L2域(即更大的LAN),如图14-6所示

So far, we have seen only simple scenarios with just one bridge. However, because transparent bridges are transparent to each other, as well as to the hosts and the routers, you can create a larger L2 domain (i.e., a bigger LAN) by employing multiple bridges, as shown in Figure 14-6.

该图还显示了两个网桥的每个接口获悉的地址列表,假设每个主机至少说过一次,从而使网桥有机会了解其位置。请注意,例如,从网桥 1 的角度来看,主机 A、B、C 和 D 都位于 LAN2 中,或者换句话说,可以通过 LAN2 上的网桥 1 的接口进行访问。毕竟,网桥并不真正关心主机所在的具体位置;它只关心主机所在的位置。它需要知道的是使用哪个端口来到达它。

The figure also shows the list of addresses that is learned by each interface of the two bridges, assuming each host has spoken at least once and thus given the bridges a chance to learn their locations. Note, for example, that from Bridge 1's perspective, Hosts A, B, C, and D are all located in LAN2, or in other words, are reachable via Bridge 1's interface on LAN2. After all, a bridge does not really care exactly where a host is located; all it needs to know is what port to use to reach it.

然而,使用多个网桥时需要在设计拓扑时小心谨慎。让我们通过图 14-7中的示例来了解原因。

The use of multiple bridges, however, requires care to be taken when designing the topology. Let's see why with the example in Figure 14-7.

具有两个桥的拓扑

图 14-6。具有两个桥的拓扑

Figure 14-6. Topology with two bridges

冗余网桥

图 14-7。冗余网桥

Figure 14-7. Redundant bridges

同一 LAN 上的多个网桥可能很有用,例如,可以提高网桥具有接口的 LAN 之间的连接可用性。如果一座桥由于某种原因无法使用,其他桥将能够保持连接。该图显示了只有两个桥的拓扑,但没有什么可以阻止您拥有更多桥。

Multiple bridges on the same LAN can be useful, for instance, to increase the availability of the connectivity between the LANs on which the bridges have interfaces. If one bridge becomes unusable for some reason, the other ones will be able to keep up connectivity. The figure shows a topology with only two bridges, but nothing would forbid you from having more.

不过,没有什么是免费的,所以正如您可以想象的,图 14-7的拓扑存在问题。问题来自于桥梁的“透明”特性,我们之前将其描述为积极的方面。接下来我们将看看我们的配置在哪里给我们带来了麻烦。

Nothing comes for free, though, so as you can imagine, there is a problem with the topology of Figure 14-7. The problem comes from the "transparency" property of bridges that we described earlier as a positive aspect. So next we'll see where our configuration gets us in trouble.

桥接环路

Bridging Loops

总体而言,透明度很好,因为位于不同 LAN 中的主机可以透明地合并,就好像只有一个公共 LAN 一样。然而,透明性也很危险,因为网桥不知道入口帧的来源。网桥的工作是从入口数据包中了解主机的位置,构建某种地址数据库,并根据此类数据库将入口帧复制到正确的端口。当同一 LAN 上有多个网桥时,您不能再假设入口帧源自接收帧的端口所连接的同一 LAN:该帧可能已被另一网桥复制到那里。 图 14-7由于这种设置造成的灾难性后果。

Overall, transparency is good because hosts located in different LANs can be transparently merged as if there were only one common LAN. However, transparency is also dangerous because a bridge does not know the origin of an ingress frame. The bridge's job is to learn the location of hosts from ingress packets, build a sort of database of addresses, and copy ingress frames to the right ports based on such a database. When you have more than one bridge sitting on the same LAN, you cannot assume anymore that an ingress frame originated in the same LAN to which the port that received the frame is connected: the frame could have been copied there by another bridge. This lack of information is so dangerous that bridges cannot be used as shown in Figure 14-7 due to the catastrophic consequences of such a setup.

例如,让我们看看在图 14-8中的场景中,当主机 A 发送数据包且网桥 1 和网桥 2 的数据库均为空(即尚未获知地址)时会发生什么情况。

Let's see, for example, what happens in the scenario in Figure 14-8, when Host A transmits a packet and both Bridge 1 and Bridge 2 have empty databases (i.e., no address has been learned yet).

桥接环路

图 14-8。桥接环路

Figure 14-8. Bridging loop

两个网桥都会收到该帧,并意识到主机 A 位于 LAN1 中,并将该帧复制到 LAN2 上。哪一个桥会先执行此操作并不确定;例如,这取决于两座桥梁的负载程度。假设他们几乎同时进行复制。因此,两个网桥将在 LAN2 上的接口上收到该帧的副本,并认为主机 A 已移动到 LAN2(请记住,它们在 LAN2 上收到的帧是主机 A 在 LAN1 上传输的原始帧的精确副本)。此时,两个网桥都会将帧复制到 LAN1(我们假设目标主机尚未回复,因此网桥不知道它所在的位置)。他们将再次收到彼此的副本,

Both bridges will receive the frame, realize that Host A is located in LAN1, and copy the frame on LAN2. Which bridge will do it first is not deterministic; it depends, for instance, on how loaded the two bridges are. Let's suppose they do the copy at almost the same time. The two bridges will therefore receive a copy of the frame on their interfaces on LAN2 and think that Host A has moved to LAN2 (remember that the frame they receive on LAN2 is an exact copy of the original one transmitted by Host A on LAN1). At this point, both bridges will copy the frame to LAN1 (we suppose the destination host has not replied yet and that therefore the bridges do not know where it is located). They will again receive each other's copy, change their minds about the location of Host A, and copy the frames to LAN2.

这是一个循环,将用相同的帧无限循环地淹没两个 LAN,并使两个 LAN 上的任何其他传输都无法进行。LAN 中其他主机的 CPU 也将忙于接收和丢弃同一帧的大量副本,如果不在接口层通过某种速率限制手段进行保护,就会崩溃。

This is a loop that will flood the two LANs with the same frame circulating endlessly, and making any other transmission on the two LANs impossible. The CPUs of the other hosts in the LANs will also be busy receiving and dropping the huge number of copies of the same frame and, if not protected at the interface layer by some rate limiting means, will collapse.

这个简单的场景告诉我们一个重要的规则:透明网桥不能用于环路拓扑。

This simple scenario tells us an important rule: transparent bridges cannot be used on loop topologies.

无环路拓扑

Loop-Free Topologies

使图 14-8的拓扑正常工作的一个简单解决方案是禁用网桥 2,并仅在网桥 1 发生故障时才启用它。然而,这个解决方案不会给我们任何真正的冗余,因为它需要某种手动干预。另一种常用的解决方案是使桥彼此可见,同时仍保持学习和复制任务透明,如前所述。

A simple solution to make the topology of Figure 14-8 work would be to disable Bridge 2 and enable it only when Bridge 1 fails. However, this solution would not give us any real redundancy because it would require some kind of manual intervention. Another solution, which is the one commonly used, is to make bridges visible to each other, while still keeping the learning and copying tasks transparent, as described earlier.

多环拓扑

图 14-9。多环拓扑

Figure 14-9. Topology with multiple loops

图 14-9显示了一个更复杂的场景。[ * ]请注意,网桥 5 在三个 LAN 上都有一个接口。应该清楚的是,所有桥梁必须合作,并且不能简单地打开或关闭桥梁;[ * ]您需要能够定义无环拓扑 具有更细的粒度。因此,您无需禁用网桥,而是禁用网桥的端口。图 14-8的拓扑并不代表常见或建议的场景;然而,即使在如此混乱的情况下,网桥也必须能够工作并提供无环路连接。

Figure 14-9 shows a more complex scenario.[*] Note that Bridge 5 has an interface on three LANs. It should be clear that all the bridges must cooperate, and you cannot simply turn bridges on or off;[*] you need to be able to define a loop-free topology with a finer granularity. So instead of disabling bridges, you disable bridges' ports. The topology of Figure 14-8 does not represent a common or suggested scenario; however, bridges must be able to work and provide loop-free connectivity even in such a mess.

让我们回到图14-8中的简单示例,并找出桥接协议使其安全的特性。如果您绘制一个图表,其中网桥和 LAN 是状态,而到 LAN 的网桥连接(即网桥端口)是(双向)链路,您将得到图 14-10

Let's return to the simple example in Figure 14-8, and find out the feature of the bridging protocol that makes it safe. If you draw a graph where bridges and LANs are states, and bridge connections to LANs (i.e., bridge ports) are (bidirectional) links, you get Figure 14-10.

图14-8 拓扑关联图

图 14-10。图14-8拓扑关联图

Figure 14-10. Graph associated with the topology of Figure 14-8

要使图 14-10中的图表无环路,您所需要做的就是通过禁用桥接端口来删除链路。图 14-9的拓扑图会更复杂,并且包含多个循环(我可以数出至少其中五个)。

All you need to make the graph of Figure 14-10 loop free is to remove a link by disabling a bridge port. The graph for the topology of Figure 14-9 would be more complicated and would include several loops (I can count at least five of them).

请注意,总是有不止一种方法可以使任何环路拓扑不再产生环路。例如,要打破图14-10中的循环,有四种不同的选择。

Note that there is always more than one way to make any loop topology loop free. For example, to break the loop in Figure 14-10, there are four different choices.

定义无环拓扑

Defining a Loop-Free Topology

如果您熟悉图论,您就会知道给定一个图(链接中包含成本),找到最佳的无环拓扑是一个经典问题,可以通过一系列不同的算法优雅地解决。然而,所有这些算法都是集中式的:算法运行一次并包含所有必要的信息。在我们的例子中,网桥需要分布式算法。网桥必须能够收敛到无环路拓扑,通过交换某种信息来禁用正确的端口。

If you are familiar with graph theory, you know that given a graph (with costs in the links), finding the best loop-free topology is a classic problem, elegantly solved by a series of different algorithms. However, all those algorithms are centralized: the algorithm runs once with all the necessary information. In our case, the bridges need a distributed algorithm. The bridges must be able to converge to a loop-free topology, disabling the right ports, by exchanging some kind of information.

网桥用来寻找最佳无环路拓扑的算法是 802.1D-1998 IEEE 标准定义的生成树协议,该协议通过新的快速生成树协议 (RSTP) 进行了扩展并成为 802.1D-2004。RSTP 有时也称为 802.1w。

The algorithm used by bridges to find the best loop-free topology is the Spanning Tree Protocol defined by the 802.1D-1998 IEEE standard, which was extended with the new Rapid Spanning Tree Protocol (RSTP) and became 802.1D-2004. RSTP is sometimes also referred to as 802.1w.

另一个有趣的扩展是多生成树协议 (MSTP),也称为 802.1s。它被集成到 802.1Q-1998,然后成为 802.1Q-2002(并且应该在今年某个时候成为 802.1Q-2005,以反映最新的变化)。

Another interesting extension is the Multiple Spanning Tree Protocol (MSTP) , known also as 802.1s. It was integrated into 802.1Q-1998, which then become 802.1Q-2002 (and should become 802.1Q-2005 sometime this year to reflect the latest changes).

为简单起见,我们将这些协议称为 STP(原始生成树协议)、RSTP 和 MSTP。第 15 章 介绍了 STP,并概述了 RSTP 和 MSTP 带来的改进。 第16章详细介绍了STP的实施。

For simplicity, we will refer to those protocols as STP (the original Spanning Tree Protocol), RSTP, and MSTP. Chapter 15 describes STP and gives an overview of the improvements introduced with RSTP and MSTP. Chapter 16 goes into detail on the implementation of STP.




[ * ]好吧,还有一些直通变体也可以处理损坏的帧。对于桥接的一般讨论,我们不需要在这一点上讨论太多细节。

[*] Well, there are variants of cut-through that can handle corrupted frames, too. We do not need to go into that much detail on this point for our generic discussion on bridging.

[ * ]图中缺少 LAN2 和 LAN3 中的主机,因为我希望您重点关注网络拓扑。图 14-9中任何 LAN 中的任何主机都会受到环路拓扑结果的影响。

[*] The figure lacks hosts in LAN2 and LAN3 because I want you to focus on the network topology. Any hosts in any LANs in Figure 14-9 would be affected by the consequences of a loop topology.

[ * ]好吧,如果您愿意,您可以关闭桥接,但您会降低可以实现的冗余程度。

[*] Well, you can turn bridges off if you like, but you would reduce the degree of redundancy that you can achieve.

第 15 章桥接:生成树协议

Chapter 15. Bridging: The Spanning Tree Protocol

我们在第 14 章中看到,透明桥接代表了一种合并 LAN 的简单方法,但它只能用于无环路拓扑。此限制消除了在使用冗余链路来提高整体可用性的网络上使用透明桥的情况。

We saw in Chapter 14 that transparent bridging represents an easy way to merge LANs, but it can be used only on loop-free topologies. This limitation eliminates the use of transparent bridges on networks where redundant links are used to increase overall availability.

在本章中,我们将了解生成树协议 (STP) 如何设法使任何拓扑都无环路,从而允许网络管理员使用具有冗余链路的拓扑。特别是,我们将看到:

In this chapter, we will see how the Spanning Tree Protocol (STP) manages to make any topology loop free, and therefore allows the network administrator to use topologies with redundant links. In particular, we will see:

  • STP 使用的分布式算法如何通过禁用正确的冗余链路来实现无环路拓扑。STP 选择的无环拓扑是树(根据定义,它是无环的)。由网桥连接的 LAN 主机之间的所有流量都沿着该树的链路传输。

  • How the distributed algorithm used by STP leads to a loop-free topology by disabling the right redundant links. The loop-free topology selected by STP is a tree (which by definition is loop free). All the traffic between the hosts of the LANs connected by the bridges travels along the links of this tree.

  • STP 如何动态更新拓扑以应对配置更改以及网桥或链路故障。

  • How STP dynamically updates the topology to cope with configuration changes and bridge or link failures.

  • 当检测到拓扑变化时,STP 如何动态更新转发数据库(即在网桥端口上获知的地址)。

  • How STP dynamically updates the forwarding database (i.e., addresses learned on the bridge ports) when changes in the topology are detected.

由于篇幅有限,我们无法详细介绍 STP。本章的目标是为您提供足够详细的概述,使您熟悉第 16 章中讨论的 STP 内核实现的描述。有关 802.1D STP 及其增强功能的完整讨论,请参阅 IEEE 规范。

We cannot go into detail on the STP for lack of space. The goal of this chapter is to give you an overview detailed enough to make you comfortable with the description of the kernel implementation of STP discussed in Chapter 16. For a complete discussion of the 802.1D STP and its enhancements, please refer to the IEEE specifications.

本章提供的示例不能作为如何配置 STP 的指南:大多数示例仅用于展示如何满足特定条件以及 STP 如何处理这些条件。

The examples provided in this chapter do not function as a guide on how to configure STP: most of them are used only to show how specific conditions can be met and how STP handles them.

基本术语

Basic Terminology

让我们定义一些将在本章和下一章中使用的关键术语:

Let's define a few key terms that will be used in this and the following chapter:

局域网
LAN

这个术语不需要任何介绍。我们使用术语 LAN 不仅指类似以太网的网络,还指点对点连接。

This term should not need any introduction. We will use the term LAN to refer not only to Ethernet-like networks but also to point-to-point connections.

二层网络
L2 network

一个 LAN 或一组 LAN由桥梁合并。正如我们在第 14 章中看到的,网桥的使用允许合并多个 LAN,并且看起来像一个更大的 LAN(在路由器和主机的眼中,即在更高网络层运行的设备的眼中)。

A LAN or a set of LANs merged by bridges. As we saw in Chapter 14, the use of bridges allows multiple LANs to be merged and look like a single bigger LAN (to the eyes of routers and hosts—i.e., to the eyes of devices operating at higher network layers).

桥接(或交换)网络
Bridged (or switched) network

使用网桥实现的 L2 网络。

An L2 network implemented with bridges.

桥接端口
Bridge port

界面
Interface

在桥接设备上,每个网络接口是一个桥接端口。在更通用的系统上,例如运行 Linux 的 PC,网络接口不一定用作桥接端口。在本章中,术语桥接端口接口可以互换使用,但在第 16 章中,我们需要区分桥接和非桥接接口,我将使用术语 接口仅指非桥接 NIC。网桥的每个端口都可用于链接主机和其他网桥。我们将在本章和后续章节中看到几个示例。

On a bridge device, each network interface is a bridge port. On a more general-purpose system, such as a PC running Linux, a network interface is not necessarily used as a bridge port. In the context of this chapter, the terms bridge port and interface could be used interchangeably, but in Chapter 16, where we will need to distinguish between bridging and nonbridging interfaces, I will use the term interface to refer to nonbridging NICs only. Each port of a bridge can be used to link both hosts and other bridges. We will see several examples in this and the following chapters.

关联
Link

两个设备之间的连接。在本章中,我将使用该术语来指代两个桥之间的连接。

A connection between two devices. In this chapter, I will use the term to refer to the connection between two bridges.

网络稳定
Stable network

L2 网络,其中 STP 已收敛到最终的无环拓扑。

An L2 network where the STP has converged to the final loop-free topology.

图 15-1显示了本章中使用的术语以及它们与其他日常网络术语的关系。

Figure 15-1 shows the terms as they are used in this chapter and their relationship to other everyday networking terms.

分层交换 L2 拓扑示例

Example of Hierarchical Switched L2 Topology

我们知道两台主机可以通过交叉网线相互连接;您不一定必须使用集线器或网桥等设备。您可以在网桥和路由器之间执行相同的操作。在本章的示例中,您经常会看到网桥之间的此类交叉电缆链路。

We know that two hosts can be connected to each other with a cross cable; you do not necessarily have to use a device such as a hub or a bridge. You can do the same between bridges and routers. In the examples in this chapter, you will often see such cross-cable links between bridges.

与第 14 章中的场景不同,在第 14章中,简单的两端口网桥直接链接到位于所连接的 LAN 中的主机,真正的桥接网络通常具有类似于树的拓扑,其中主机仅(或主要)位于叶节点。

Unlike the scenarios in Chapter 14, where simple two-port bridges link directly to hosts located in the connected LANs, a real bridged network normally has a topology that resembles a tree, where hosts are located only (or mainly) at the leaves' node.

基本术语

图 15-1。基本术语

Figure 15-1. Basic terminology

当你有像第 14 章中看到的简单场景时,你通常不会启用 STP;事实上,大多数站点可以简单地使用单个平面 LAN,而不是使用网桥。为了更好地理解 STP,我们需要看看现实生活中的场景是什么样的。让我们采用图 15-2(a)中的经典分层桥接和冗余拓扑,大多数商业网桥供应商都宣传和宣传这种拓扑。

When you have simple scenarios like the ones seen in Chapter 14, you do not usually enable the STP; in fact, most sites could simply use a single flat LAN instead of employing bridges at all. To better understand STP, we need to see what a real-life scenario looks like. Let's take the classic hierarchical bridged and redundant topology in Figure 15-2(a), which is advertised and evangelized by most of the commercial bridge vendors.

该图省略了本章稍后描述的详细信息,例如网桥 ID 和优先级、端口成本和优先级值,以便让您专注于拓扑和活动链路选择。在本章的后续图中,我们将重用图底部图例中的符号定义。

The figure leaves out details described later in this chapter, such as the bridge ID and priority, port cost, and priority values, to let you focus on the topology and active links selection. In the subsequent figures in this chapter, we will reuse the symbol definitions in the legend at the bottom of the figure.

树的叶子(图的底部)是主机。主机链接到所谓的访问桥 (通常称为接入交换机):为主机提供网络连接的网桥。接入网桥主要用于转发连接到同一网桥的主机之间的流量,但它们也有一条或多条到上层网桥的链路。图 15-2中的接入桥标记为 A1、A2、A3 和 A4。

At the leaves of the tree (the bottom of the figure) are the hosts. The hosts are linked to so-called access bridges (commonly called access switches): the bridges that give network connectivity to the hosts. Access bridges are mainly used to forward traffic between the hosts linked to the same bridge, but they also have one or more links to the upper-layer bridges. The access bridges in Figure 15-2 are labeled A1, A2, A3, and A4.

由于主机始终位于拓扑的叶节点,因此您可以根据需要拥有任意数量的主机链接。它们不会导致任何循环(当然,我们假设叶子之间没有链接)。因此,到主机的链路不受 STP 影响:无需禁用任何链路即可定义无环路拓扑。毕竟,STP 的最终目标是使网络看起来像一个大型的单一 LAN 并为所有主机提供连接,那么为什么要断开任何主机的连接呢?

Because hosts are always located at the leaves of the topology, you can have as many links to hosts as you like. They will not cause any loops (of course, we assume there are no links between leaves). Because of that, the links to the host are not affected by the STP: none of them needs to be disabled to define a loop-free topology. After all, the ultimate goal of STP it to make the network look like a big single LAN and provide connectivity to all hosts, so why would you disconnect any of the hosts?

分层桥接 L2 拓扑

图 15-2。分层桥接 L2 拓扑

Figure 15-2. Hierarchical bridged L2 topology

分布层的网桥(图中的D1和D2)主要用于桥接位于其直接连接的一些接入网桥中的主机之间的流量。例如,D1 将照顾 A1 和 A2。

A bridge at the distribution layer (D1 and D2 in the figure) is mainly used to bridge traffic between hosts located in some of the access bridges it is directly connected to. For example, D1 will take care of A1 and A2.

请注意,D1 也链接到 A3 和 A4,尽管当前 D1 的链接处于非活动状态(图中的虚线)。如果 D2 和 A3 之间的链路发生故障,STP 将确保更新拓扑,以便 A3 再次成为树的一部分。例如,网络可以启用D1和A3之间的链路;我们将在本章后面看到如何做。

Note that D1 is also linked to A3 and A4, although currently D1's links are inactive (dotted lines in the figure). In case the link between D2 and A3 fails, the STP will make sure that the topology is updated so that A3 is again part of the tree. For example, the network could enable the link between D1 and A3; we will see how later in this chapter.

两个分配桥D1和D2还链接到两个核心桥C1和C2。应该清楚C1和C2的工作是什么:将D1的子树连接到D2的子树。(另一种解决方案是使用单个且可能更强大的核心桥。)在分布层和核心层之间还存在冗余链路,因此,例如,如果链路 C1-D1 发生故障,C2-D1 将采取超过。网桥所在的层越高,处理的流量就越大(因为子树越大)。

The two distribution bridges D1 and D2 are also linked to the two core bridges C1 and C2. It should be clear what C1 and C2's job is: to connect D1's subtree to D2's subtree. (An alternative solution would be one with a single, and maybe more powerful, core bridge.) Between the distribution and core layers there are also redundant links so that if, for example, the link C1-D1 failed, C2-D1 would take over. The higher the layer where a bridge is located, the bigger the volume of traffic that is processed (because the subtree is bigger).

该图显示了 STP 选择定义无环拓扑的拓扑链路,以及哪些端口已被分配指定角色和根角色。在本章中,我们将了解指定角色和根角色的用途、如何分配它们以及原因。

The figure shows the links of the topology that the STP has selected to define the loop-free topology, and what ports have been assigned the designated and root roles. In this chapter, we will see what the designated and root roles are used for, how they are assigned, and why.

请注意,图 15-2(a)的 L2 网络内的任何一对主机之间交换的流量都 使用 L2 协议(即以太网)进行传输。路由可以在核心实现,也可以通过核心实现。从主机的角度来看,链路层没有层次结构,只有扁平的局域网;整体拓扑如图15-2(b)所示。[ * ]

Note that the traffic exchanged between any pair of hosts within the L2 network of Figure 15-2(a) uses L2 protocols to travel (i.e., Ethernet). Routing can be implemented at the core or through the core. From the host's perspective, there is no hierarchy at the link layer, only a flat LAN; the overall topology appears to it like Figure 15-2(b).[*]

使用多个网桥有几个优点:

The use of multiple bridges has a few advantages:

  • 它有助于隔离流量。例如,当主机 1 与主机 10 通话时,主机 11 可以与主机 20 通话,主机 21 可以与主机 40 通话,所有这些都无需接收和丢弃彼此的帧。这样L2网络的整体带宽就增加了。但在最坏的情况下,一个帧可能需要穿过整棵树才能到达目的地。例如,图 15-3显示了需要从主机 40 到主机 1 的帧的路径。请注意,该图还显示了每个网桥端口学习到的地址:例如,符号 1-10 接近于网桥端口接口意味着后者已经学习了主机 1 到 10 的 MAC 地址。(我们在第 14 章中看到地址学习如何进行。)

  • It helps segregate traffic. For example, while Host 1 talks to Host 10, Host 11 can talk to Host 20, and Host 21 can talk to Host 40, all without having to receive and discard each other's frames. So the overall bandwidth of the L2 network is increased. But in the worst case, a frame may need to cross the entire tree to get to its destination. For example, Figure 15-3 shows the path of a frame that needs to go from Host 40 to Host 1. Note that the figure also shows the address learned by each bridge port: for example, the notation 1-10 close to a bridge's interface means that the latter has learned the MAC addresses of Hosts 1 through 10. (We saw in Chapter 14 how address learning works.)

  • 大量主机变得更易于管理。您不需要将所有主机连接到单个巨型桥,这意味着主机可以位于不同的区域。布线也更容易维护。

  • Large numbers of hosts become easier to manage. You do not need to connect all the hosts to a single giant bridge, which means the hosts can be located in different areas. Cabling is also simpler to take care of.

我将在此结束对 L2 桥接拓扑的概述。您需要一整本书来详细介绍桥接协议和 STP,因此我将继续概述 STP 实现的算法。

I'll end my overview of L2 bridged topologies here. You would need a whole book to cover bridging protocols and STP in detail, so I'll move ahead with an overview of the algorithm implemented by the STP.

在本章的其余部分,我们将使用更简单的拓扑来描述协议。然而,我们将看到在更大、更复杂的网络中,如图15-2所示,其工作方式是一样的。

In the rest of this chapter, we will use simpler topologies to describe the protocol. However, what we will see works just the same way in bigger and more complex networks like the one in Figure 15-2.

生成树协议的基本要素

Basic Elements of the Spanning Tree Protocol

寻找最佳无根生成树或最佳有根生成树是运筹学中的一个经典问题。文献提供了计算复杂度不同的不同算法。现实生活中有大量的应用程序经常使用这些算法。

The search for the best nonrooted spanning tree or the best rooted spanning tree is a classic problem in operational research. The literature provides different algorithms that differ in computational complexity. There are a huge number of applications in real life where those algorithms are commonly used.

桥接网络示例

图 15-3。桥接网络示例

Figure 15-3. Example of bridged network

我们在本章中描述的 STP 有一个有点相似的目标:给定一个图和一个根节点 R ,定义以R为根的最佳生成树。然而,有一个重要的区别:该算法不是在单个主机上执行,然后将结果分发给所有其他主机;相反,这是一个分布式协议。网络中的所有网桥都必须运行它。通过运行该协议,它们启用一些端口并禁用其他端口,随后的整体拓扑就是最佳根生成树。根节点的选择也是协议的一部分:主机就谁是根节点达成一致,然后决定启用和禁用哪些链接。

The STP we describe in this chapter has a somewhat similar goal: given a graph and a root node R, define the best spanning tree rooted in R. However, there is one important difference: the algorithm is not executed on a single host that later distributes the result to all the others; instead, this is a distributed protocol. All bridges in the network must run it. By running this protocol, they enable some of their ports and disable others, and the overall topology that follows is the best rooted spanning tree. The selection of the root node is also part of the protocol: the hosts agree on who is the root node and then decide what links to enable and disable.

让我们尝试理解“最佳生成树”的确切含义。给定一个图和一个要作为根的节点,最佳生成树是无环拓扑(树),它可以最小化每个节点与根节点的距离。根据图表的不同,可能会有不止一棵树具有相同的优度得分[ * ]图 15-4显示了一个具有两个同样好的解决方案的示例)。

Let's try to understand what "best spanning tree" means exactly. Given a graph and a node you want to be the root, the best spanning tree is the loop-free topology (tree) that minimizes the distance of each node from the root node. Depending on the graph, there could be more than one tree with the same goodness score[*] (Figure 15-4 shows an example with two equally good solutions).

当未为链路分配成本或为所有链路分配相同的成本(这将是等效的)时,从节点到根的距离将被测量为链路数(即网络跳数)。然而,当您将成本与链路相关联时,跳数并不一定表明路径的良好程度。

When the links are not assigned a cost, or are all assigned the same cost (which would be equivalent), the distance from a node to the root is measured as the number of links (that is, network hops). However, when you associate a cost with the links, the number of hops is not necessarily an indication of the goodness of the path.

没有为链接分配成本的图表

图 15-4。没有为链接分配成本的图表

Figure 15-4. Graph with no costs assigned to the links

例如,如果我们向图 15-4中的拓扑添加成本,则关联的最佳根生成树与图 15-4中的两种解决方案都不同,如图15-5所示。

For example, if we add costs to the topology in Figure 15-4, the associated best rooted spanning tree differs from both of the solutions in Figure 15-4, as shown in Figure 15-5.

我们将在“网桥和端口 ID ”部分中看到网络管理员如何使用其他值覆盖默认成本,例如基于货币成本和连接可靠性等管理参数的选择。

We will see in the section "Bridge and Port IDs" how the network administrator can override default costs with other values, basing the choice, for instance, on administrative parameters such as monetary cost and reliability of the connections.

STP 通过让每个网桥 与其邻居交换称为网桥协议数据单元(BPDU) 的专用帧来实现其目标。与 BPDU 交换的信息允许网桥:

The STP achieves its goal by having each bridge exchange specialized frames called bridge protocol data units(BPDUs) with its neighbors. The information exchanged with BPDUs allows bridges to:

  • 为每个网桥端口分配一个已定义的状态,例如转发或阻止,该状态定义端口上是否可以接受数据流量

  • Assign each bridge port a defined state, such as forwarding or blocking, that defines whether data traffic can be accepted on the port

包含分配给链接的成本的图表

图 15-5。包含分配给链接的成本的图表

Figure 15-5. Graph with costs assigned to the links

  • 通过此端口状态分配,从环路拓扑中选择并丢弃正确的链路,从而实现无环路拓扑

  • Select and discard the right links from the loop topology by means of this port state assignment, leading this way to a loop-free topology

一些网桥端口被分配特殊角色,例如取决于它们是否通向树的根(所谓的根网桥) )或树的叶节点。

Some of the bridge ports are assigned special roles, depending, for example, on whether they lead toward the root of the tree (the so-called root bridge ) or the leaf nodes of the tree.

考虑到图论中使用的精确术语,STP 也使用定义明确的术语来指代节点(网桥)和链路(网桥端口)也就不足为奇了。在看算法之前,我需要解释一下:

Given the precise terminology used in graph theory, it should not be a surprise that the STP also uses a well-defined terminology to refer to nodes (bridges) and links (bridge ports). Before looking at the algorithm, I need to explain:

  • 什么是根桥和指定桥

  • What a root bridge and designated bridge are

  • 可以为网桥端口分配哪些状态和角色

  • What states and roles can be assigned to a bridge's port

  • 根端口和指定端口的工作

  • The job of root and designated ports

根桥

Root Bridge

根桥不仅仅是拓扑中的占位符;它也是拓扑中的一个占位符。它在算法中起着核心作用。例如,在接下来的部分中,您将看到:

The root bridge is not just a placeholder in the topology; it plays a central role in the algorithm. For example, in the next sections, you will see that:

  • 根桥是唯一生成 BPDU 的桥。其他网桥仅在收到 BPDU 时才传输 BPDU(即,它们通过简单地更新几个字段来修改收到的信息)。

  • The root bridge is the only bridge that generates BPDUs. The other bridges transmit BPDUs only when they receive one (i.e., they revise the information they receive by simply updating a couple of fields).

  • 根网桥确保网络中的每个网桥在发生拓扑更改时都知道(请参阅“拓扑更改”部分)。

  • The root bridge makes sure each bridge in the network comes to know about a topology change when one occurs (see the section "Topology Changes").

请注意,端口状态和角色的选择(以及应启用或禁用的链路)取决于根桥在拓扑中的位置:这是因为首先选择根桥,然后构建根桥。基于此的最佳树。

Note that the selection of the port states and roles (and therefore of the links that should be enabled or disabled) depends on the location of the root bridge in the topology: this is because first you select the root bridge, and then you build the best tree based on that.

指定桥梁

Designated Bridges

虽然每棵树只有一个根桥,但每个 LAN 有一个指定桥,它成为 LAN 上的所有主机和网桥用来到达根的桥。通过确定 LAN 上哪个网桥到根网桥的路径成本最低来选择指定网桥。

While each tree has only one root bridge, there is one designated bridge for each LAN, which becomes the bridge all hosts and bridges on the LAN use to reach the root. The designated bridge is chosen by determining which bridge on the LAN has the lowest path cost to the root bridge.

因此,以图 15-2为例:[ * ]

Thus, using Figure 15-2 as an example:[*]

  • 在A3-D2 LAN中,D2是指定网桥。

  • In the A3-D2 LAN, D2 is the designated bridge.

  • 在D2-C2 LAN中,C2是指定网桥。

  • In the D2-C2 LAN, C2 is the designated bridge.

生成树端口

Spanning Tree Ports

我们在前面的章节中介绍了根桥和指定桥的角色。让我们看看可以为桥接端口分配哪些状态和角色。

We introduced the root and designated bridge's roles in the previous sections. Let's see here what states and roles can be assigned to a bridge port.

港口国

Port states

一个STPport 是运行 STP 的网桥中的端口。该端口将根据“何时发送配置BPDU ”一节中的规则处理入口BPDU 并发送BPDU。

An STP port is a port in a bridge that runs the STP. This port will process ingress BPDUs and transmit BPDUs according to the rules in the section "When to Transmit Configuration BPDUs."

STP端口可以处于以下任一状态:

An STP port can be in any of the following states:

残疾人
Disabled

端口被关闭(通过管理操作);它不接收或传输任何流量。

The port is shut down (through administrative action); it does not receive or transmit any traffic.

阻塞
Blocking

端口已打开,但 STP 已阻止它。它不能用于转发任何数据流量。

The port is up, but the STP has blocked it. It cannot be used to forward any data traffic.

听力
Listening

该端口已启用,但不能用于转发任何数据流量。

The port is enabled, but it cannot be used to forward any data traffic.

学习
Learning

端口已启用,但不能用于转发任何数据流量;然而,网桥的地址学习过程是活跃的。

The port is enabled, but it cannot be used to forward any data traffic; however, the bridge's address learning process is active.

转发
Forwarding

端口已启用,学习处于活动状态,并且可以转发数据流量。

The port is enabled, learning is active, and data traffic can be forwarded.

中间学习状态的使用允许网桥减少空转发数据库所需的洪泛量。

The use of the intermediate learning state allows a bridge to reduce the amount of flooding that would otherwise be required with an empty forwarding database.

除处于禁用状态的端口外,无论端口状态如何,都会处理入口 BPDU。处于给定状态的端口是接收入口 BPDU 还是发送 BPDU 取决于端口的角色,这在“端口角色”一节中进行了介绍。

With the exception of ports in the disabled state, ingress BPDUs are processed regardless of the port state. Whether a port in a given state receives ingress BPDUs or transmits BPDUs depends on the port's role, which is introduced in the section "Port roles."

图 15-6显示了端口状态如何改变。从阻塞、聆听和学习到最活跃的状态(转发)有一个明显的进展。阻塞和转发之间的转换是由协议根据各种因素决定的(参见后面的“定义活动拓扑”一节)。注意:

Figure 15-6 shows how the state of a port can change. There is a clear progression from blocking through listening and learning to the most active state, forwarding. The transitions between blocking and forwarding are decided by the protocol based on various factors (see the later section "Defining the Active Topology"). Note that:

  • 正在进入转发状态的端口可以在进入转发状态之前移回阻塞状态。例如,当拓扑尚未稳定并且因此其状态可能在短时间内重复变化时,这是可能的。

  • A port on its way to the forwarding state can be moved back to blocking before the forwarding state is entered. This is possible, for instance, when a topology is not stable yet and therefore its state may change repeatedly in a short amount of time.

  • 从阻塞到转发的中间状态之间的转换由计时器驱动(请参阅“计时器”部分),并且需要避免临时循环的风险(请参阅“避免临时循环”部分)。

  • The transitions between the intermediate states from blocking to forwarding are driven by a timer (see the section "Timers") and are needed to avoid the risk of temporary loops (see the section "Avoiding Temporary Loops").

港口国转换

图 15-6。港口国转换

Figure 15-6. Port state transitions

此外,管理员可以手动将端口从任何这些状态中删除并禁用它。当端口被管理禁用后,只能通过另一个管理干预来重新启用;STP不具备此能力。然而,网桥可以在协议定义的基本功能之上实现可选功能,以启用和禁用端口,而无需管理干预。

In addition, an administrator can manually remove a port from any of these states and disable it. When a port is administratively disabled, it can be re-enabled only by another administrative intervention; the STP does not have this capability. However, bridges can implement optional features on top of the basic ones defined by the protocol, to enable and disable ports without administrative intervention.

端口角色

Port roles

生成树协议端口可以分配以下两个角色之一:

STP ports can be assigned one of the following two roles:

Root

对于每个桥,除根桥外,选择到根桥的路径成本最低的端口作为根端口。

For each bridge, with the exception of the root bridge, the port with the lowest path cost to the root bridge is selected as the root port.

指定的
Designated

在每个LAN中,选择到根桥的路径开销最小的端口作为指定端口。指定端口所属的网桥称为 LAN 的指定网桥。请注意,端口位于不同 LAN 的网桥可以有多个指定端口,如图15-2所示。用于选择指定端口的标准将在稍后的“指定端口选择”部分中描述。

On each LAN, the port with the smallest path cost to the root bridge is selected as the designated port. The bridge to which the designated port belongs is called the designated bridge for the LAN. Note that a bridge with ports on different LANs can have more than one designated port, as shown in Figure 15-2. The criteria used to select designated ports are described later in the section "Designated Port Selection."

根端口通向树的根部(即根桥),而指定端口则通向叶子。在图15-2中,您可以看到根端口和指定端口之间的关系。

While root ports lead toward the root of the tree (i.e., the root bridge), designated ports lead toward the leaves. In Figure 15-2, you can see the relationship between root and designated ports.

从树的角度来看,这两个角色可以这样看:

From a tree's perspective, the two roles can be seen in this way:

  • 树的根具有仅通向叶子的链接(即,仅指定端口[ * ])。

  • The tree's root has links that go only toward the leaves (i.e., only designated ports[*]).

  • 叶节点具有仅通向树根的链路(即,没有指定端口和一个根端口)。为了防止错误配置和错误布线(例如将网桥连接到应该连接主机的端口),叶节点也可以在连接主机的端口上运行 STP。在这种情况下,叶节点没有指定端口的假设将不再有效。换句话说,如果在图 15-2中连接主机的接入桥的端口上启用 STP,这些端口最终将被分配指定的角色。

  • The leaf nodes have links that go only toward the tree's root (i.e., no designated ports and one root port). As a protection against misconfigurations and wrong cabling (such as connecting a bridge to a port where you are supposed to connect a host), a leaf node can run the STP on the ports that connect the hosts, too. In this case, the assumption that a leaf node does not have designated ports would no longer be valid. In other words, if you enable the STP on the ports of the access bridges in Figure 15-2 that connect the hosts, those ports would end up being assigned the designated role.

  • 根和叶子之间的任何节点都至少有一条通往根的链路(其中一个将被选为根端口),并且至少有一条通往叶子的链路(指定端口)。

  • Any node between root and leaves has at least one link toward the root (one of which will be selected as the root port), and at least one toward the leaves (a designated port).

存在既不是根端口也不是指定端口的 STP 端口;当网桥之间有冗余链路时,这是可能的。图15-2中以A1端口到D2为例。我将在“较新的生成树协议概述”一节中简要介绍较新的 STP 协议,它定义了新的角色,以便为每个 STP 端口分配一个角色。

There are STP ports that are neither root nor designated ports; this is possible when you have redundant links between bridges. In Figure 15-2, the A1 port that goes to D2 is an example. The newer STP protocols, which I will briefly introduce in the section "Overview of Newer Spanning Tree Protocols," define new roles so that each STP port is assigned one.

我们将分别在“根端口选择”和“指定端口选择”部分中了解如何分配根端口角色和指定端口角色。

We will see how the root and designated port roles are assigned in the sections "Root Port Selection" and "Designated Port Selection," respectively.

网桥和端口 ID

Bridge and Port IDs

根桥以及端口状态和角色的选择取决于一组参数。每个参数都分配有一个默认值,可以通过用户配置进行更改。主要参数如下:

The selection of the root bridge and the port state and roles depends on a set of parameters. Each parameter is assigned a default value that can be changed by user configuration. Here are the main parameters:

桥ID
Bridge ID

每个网桥都分配有一个 ID,称为网桥 ID,它定义为分为两个部分的 8 字节值。最低的 6 个字节分配给桥接端口之一的以太网 MAC 地址(请参阅第 16 章),最高的 2 个字节是可配置的优先级,称为桥接优先级 网桥 ID 是根网桥选择算法使用的字段(请参阅“根网桥选择”部分)。

Each bridge is assigned an ID, called the bridge ID, that is defined as an 8-byte value split into two components. The lowest six bytes are assigned the Ethernet MAC address of one of the bridge ports (see Chapter 16), and the highest two bytes are a configurable priority, called the bridge priority . The bridge ID is the field used by the root bridge selection algorithm (see the section "Root Bridge Selection").

端口号
Port ID

每个端口都分配有一个 ID。ID 的一部分代表称为端口号的唯一标识符。端口号的分配方式取决于实现,并且其值仅在网桥本地有意义。例如,该数字可以反映端口创建的顺序:第一个端口分配为 1,第二个端口分配为 2,等等。另一种方法可以使用端口的物理位置:例如,总线上的第一个端口是分配 1 等。希望端口号分配在重新启动后具有确定性和一致性,以便系统管理员无需更改网桥配置来反映重新启动后的更改。

ID 的另一部分称为端口优先级 ,用于为端口分配优先级(其中较低的值表示较高的优先级)。见图15-7(b)

有关使用此参数的示例,请参阅“根端口选择”部分。

Each port is assigned an ID. A portion of the ID represents a unique identifier called the port number. The way the port number is assigned is implementation dependent, and its value is meaningful only locally on the bridge. For example, the number can reflect the sequence in which ports were created: the first port is assigned 1, the second port 2, etc. Another approach could use the physical location of the port: for example, the first port on the bus is assigned 1, etc. It is desirable to have the port number assignments be deterministic and consistent across reboots so that the system administrator does not need to change the bridge configuration to reflect the changes after a reboot.

Another portion of the ID, called the port priority , is used to assign a priority to the port (where a lower value means a higher priority). See Figure 15-7(b).

See the section "Root Port Selection" for an example of when this parameter is used.

除了网桥和端口优先级之外,用户还可以配置以下参数:

Besides the bridge and port priority, the user can configure the following parameters:

港口费用
Port cost

每个端口都分配有一个成本。值越低,端口越优先。如果未明确配置,则会根据端口的速度为端口分配默认成本。例如,为运行速度为 100 Mbits/s 的快速以太网端口分配的成本低于运行速度为 10 Mbits/s 的以太网端口。在大多数情况下,当从树的一个点到另一点的总体成本是根据延迟来衡量时,默认成本分配是有意义的。然而,在特定情况下,管理员可能更愿意根据外部因素明确分配成本。

Each port is assigned a cost. The lower the value, the more preferred the port is. When not explicitly configured, the port is assigned a default cost based on the port's speed. For example, a Fast Ethernet port that runs at 100 Mbits/s is assigned a lower cost than an Ethernet port that runs at 10 Mbits/s. The default cost assignment makes sense in most cases, when the overall cost of going from one point of the tree to another is measured in terms of latency. However, it is possible that in specific contexts, the administrator prefers to explicitly assign costs based on external factors.

定时器
Timers

STP使用一组每端口和每桥定时器。它们都有默认配置,可由用户自定义。请参阅“定时器”部分。定时器的配置不影响根桥的选择以及端口的状态和角色。

The STP uses a set of per-port and per-bridge timers . All of them have a default configuration that can be customized by the user. See the section "Timers." The timer configuration does not affect the selection of the root bridge and the port state and roles.

我们将在本章后面看到如何使用这些参数的配置(定时器除外)来影响拓扑的选择。

We will see later in this chapter how the configuration of these parameters (with the exception of the timers) can be used to influence the selection of the topology.

2001年,IEEE发布了802.1t、802.1D的维护文档,改变了网桥和端口ID的定义方式。格式的变化如图15-7所示。

In 2001, the IEEE released the 802.1t, 802.1D's maintenance document, which changed how bridge and port IDs are defined. The changes in format are shown in Figure 15-7.

802.1t 引入的网桥 ID 和端口 ID 更改

图 15-7。802.1t 引入的网桥 ID 和端口 ID 更改

Figure 15-7. Bridge ID and port ID changes introduced by 802.1t

注意:

Note that:

  • 桥接优先级现在大小只有 4 位。为了向后兼容,桥优先级范围仍然是 0-64 K,但由于您只有 4 位可以使用,所以您现在的优先级增量为 4,096 (2 12

  • The bridge priority is now only 4 bits in size. For backward compatibility, the bridge priority range is still 0-64 K, but since you have only four bits to play with, you now have priorities in increments of 4,096 (212).

  • 网桥 ID 中有一个新组件,称为系统 ID 扩展。例如,该组件可以采用 4,096 个不同的值,允许网络设备拥有最多 4,096 个不同的网桥 ID共享单个 MAC 地址。以前,这需要 4,096 个不同的 MAC 地址。请注意,MAC 地址不是管理员选择的随机数;它们是由 IEEE 管理的全球唯一编号(因此资源有限)。

  • There is a new component in the bridge ID, called the system ID extension. This component, which can assume 4,096 different values, allows a network device, for example, to have up to 4,096 different bridge IDs sharing a single MAC address. Before, this would have required 4,096 different MAC addresses. Note that MAC addresses are not random numbers chosen by the administrator; they are worldwide unique numbers (and therefore are a limited resource) that are managed by the IEEE.

  • 端口号现在是 12 位值,这使得网桥最多可以有 4,096 个端口。以前你只能拥有256(本来这已经被认为是相当奢侈的了)。端口优先级现在是一个 4 位值。为了向后兼容,优先级范围仍然是 1-256,因此优先级现在以 16 为增量进行分配。

  • The port number is now a 12-bit value, which allows a bridge to have up to 4,096 ports. Before you could have had only 256 (which was originally considered quite luxurious). The port priority is now a 4-bit value. The priority range is still 1-256 for backward compatibility, so priorities are now assigned in increments of 16.

要了解 802.1t 变化的原因,您需要考虑高端商用设备,而不是仅配备几个网卡的普通 PC。后者可以在 256 个桥接端口的限制下生存,或者每个 MAC 地址只有一个桥接 ID。然而,大型网桥配备数百个端口并运行数百个网桥实例的情况并不少见。

To understand the reasons for the 802.1t changes, you need to think in terms of high-end commercial devices, not common PCs equipped with just a few NICs. The latter can survive with a limit of 256 bridge ports, or a single bridge ID per MAC address. However, it is not uncommon for big bridges to be equipped with hundreds of ports and to run hundreds of instances of bridges.

另请注意,4,096 不是随机值:它表示 802.1Q 协议中允许的虚拟 LAN (VLAN) 的最大数量。

Note also that 4,096 is not a random value: it represents the maximum number of Virtual LANs (VLANs) allowed in the 802.1Q protocol.

802.1t 更改不会对 STP 产生任何影响。从 STP 的角度来看,网桥 ID 是一个 8 字节值,端口 ID 是一个 2 字节值。用户配置组件的大小或用途并不重要。这意味着 802.1t 更改仅影响配置工具,而不影响协议的行为。表 15-115-2总结了不同参数的可能值。

The 802.1t changes do not have any impact on the STP. From the STP's perspective, a bridge ID is an 8-byte value and a port ID is a 2-byte value. The size or purpose of the user-configuration component does not matter. This means that the 802.1t changes affect only configuration tools, not the protocol's behavior. Tables 15-1 and 15-2 summarize the possible values of the different parameters.

表 15-1。802.1t之前的网桥ID和端口ID

Table 15-1. Bridge IDs and port IDs before 802.1t

 

默认值

Default value

分钟。价值

Min. value

最大限度。价值

Max. value

分钟。增量

Min. increment

桥优先级

Bridge priority

32,768

32,768

0

0

65,535

65,535

1

1

港口费用

Port cost

取决于端口速度

Depends on port speed

1

1

65,535

65,535

1

1

端口优先级

Port priority

128

128

0

0

255

255

1

1

表 15-2。802.1t后的网桥ID和端口ID

Table 15-2. Bridge IDs and port IDs after 802.1t

 

默认值

Default value

分钟。价值

Min. value

最大限度。价值

Max. value

分钟。增量

Min. increment

桥优先级

Bridge priority

32,768

32,768

0

0

61,440

61,440

4,096

4,096

港口费用

Port cost

取决于端口速度

Depends on port speed

1

1

200,000,000

200,000,000

1

1

端口优先级

Port priority

128

128

0

0

240

240

16

16

桥接协议数据单元 (BPDU)

Bridge Protocol Data Units (BPDUs)

网桥交换称为 BPDU 的协议帧,其中包含足够的信息,以便它们就谁是根网桥达成一致,并决定其本地端口的角色和状态。BPDU 有两种类型:

Bridges exchange protocol frames, called BPDUs, that include enough information for them to agree on who is the root bridge, and to decide on the roles and states for their local ports. There are two kinds of BPDUs:

配置BPDU
Configuration BPDU

用于定义无环路拓扑。您将在“何时传输配置 BPDU ”部分中看到哪些条件触发这些 BPDU 的传输。

Used to define the loop-free topology. You will see in the section "When to Transmit Configuration BPDUs" what conditions trigger the transmission of these BPDUs.

拓扑更改通知 (TCN) BPDU
Topology Change Notification (TCN) BPDU

由网桥用来向根网桥通知检测到的拓扑更改。请参阅“拓扑更改”部分。

Used by a bridge to notify the root bridge about a detected topology change. See the section "Topology Changes."

图 15-8显示了两个 BPDU 的格式。请注意,这两种类型共享相同的前三个字段,可以通过 BPDU 类型参数来区分。

Figure 15-8 shows the format of both BPDUs. Note that the two types share the same first three fields and can be distinguished by the BPDU type parameter.

a) 配置BPDU; b) BPDU

图 15-8。a) 配置BPDU;b) BPDU

Figure 15-8. a) Configuration BPDU; b) BPDU

三种IEEE STP使用的协议ID和协议版本的组合如表15-3所示。在本章中,我们将仅了解基本的 802.1D 协议,并在“较新的生成树协议概述”部分中简要介绍其他两个协议

Table 15-3 shows the combinations of protocol ID and protocol version used by the three IEEE STPs. In this chapter, we will look only at the basic 802.1D protocol and briefly introduce the other two in the section "Overview of Newer Spanning Tree Protocols."

表 15-3。BPDU 版本

Table 15-3. BPDU versions

协议名称

Protocol name

协议号

Protocol ID

协议版本

Protocol version

STP (802.1D-1998)

STP (802.1D-1998)

0

0

0

0

RSTP(802.1D-2002 或 802.1w)

RSTP (802.1D-2002 or 802.1w)

0

0

2

2

MSTP(802.1Q-2002 或 802.1s)

MSTP (802.1Q-2002 or 802.1s)

0

0

3

3

配置BPDU

Configuration BPDU

配置BPDU中各个字段的含义如下:

Here is the meaning of the fields in the configuration BPDU:

旗帜
Flags

仅使用两个标志:TC(拓扑更改)和TCA(拓扑更改确认)。“拓扑更改”部分描述了两者的使用。

Only two flags are used: TC (Topology Change) and TCA (Topology Change Acknowledgment). The use of both is described in the section "Topology Changes."

根桥ID
Root Bridge ID

根桥的ID。这就是发送桥认为当前根桥的情况。

ID of the root bridge. This is what the transmitting bridge thinks the current root bridge is.

根路径成本
Root Path Cost

从发送桥到根桥的最短路径的成本。当发送桥是(或认为它将成为)根桥时,成本为 0。

Cost of the shortest path from the transmitting bridge to the root bridge. The cost is 0 when the transmitting bridge is (or thinks it is to become) the root bridge.

桥ID
Bridge ID

发送桥的ID。

ID of the transmitting bridge.

端口号
Port ID

端口标识符。有关其语法,请参阅“网桥和端口 ID ”部分。

Port identifier. See the section "Bridge and Port IDs" for its syntax.

留言年龄
Message Age

自根桥生成此 BPDU 中的信息以来已经过去了多少时间。请参阅“ BPDU 老化”部分。

How much time has passed since the root bridge generated the information in this BPDU. See the section "BPDU Aging."

最大年龄
Max Age

配置 BPDU 的最长生命周期。

Maximum lifetime for configuration BPDUs.

你好时间
Hello Time

Hello 计时器使用的超时时间。

Timeout used by the Hello timer.

转发延迟
Forward Delay

转发延迟计时器使用的超时。见图15-6

Timeout used by the Forward Delay timer. See Figure 15-6.

Max Age、Hello Time 和 Forward Delay 三个定时器的值不是桥本地配置的值:它们是根桥通告的值(请参阅“发送配置 BPDU”一节所有这些都以刻度(1/256秒)表示。请参阅“定时器”部分。

The values of the three timers Max Age, Hello Time, and Forward Delay are not the ones configured locally on the bridge: they are the ones advertised by the root bridge (see the section "Transmitting Configuration BPDUs"). All of them are expressed in ticks (1/256th of second). See the section "Timers."

优先向量

Priority Vector

配置 BPDU 的四个组成部分(根网桥 ID、根路径成本、网桥 ID 和端口 ID)组成了优先级向量(见图15-8)。由于这四个分量是按顺序排列的,因此该向量可以视为单个 22 字节数字。数字越小,表示网桥在拓扑中的重要性越高;换句话说,优先级向量决定谁赢得竞争角色(例如根桥和指定桥)的竞标。在本章的其余部分中,我将使用[BR-Root, Cost, BR-ID, Port-ID]表示法来引用优先级向量。

Four components of the configuration BPDU—Root Bridge ID, Root Path Cost, Bridge ID, and Port ID—make up the priority vector (see Figure 15-8). Because these four components are in sequence, this vector can be seen as a single 22-byte number. The lower the number is, the more important the bridge is in the topology; in other words, the priority vector determines who wins the bidding for contested roles such as root bridge and designated bridge. In the rest of this chapter, I will refer to priority vectors using a [BR-Root, Cost, BR-ID, Port-ID] notation.

在本章后面的示例中,图中仅显示了所传输的配置 BPDU 的优先级部分,因为这是网桥用来选择其端口角色和状态的配置 BPDU 的一部分。

In the examples later in this chapter, the figures show only the priority component of the configuration BPDUs transmitted, because that is the portion of the configuration BPDU used by the bridges to select their port's roles and states.

给定两个优先级向量 PV1= [BR-Root1, Cost1, BR-ID-1, Port-ID1]和 PV2= [BR-Root2, Cost2, BR-ID-2, Port-ID2],PV1 被认为是优越的当PV1的数值比PV2低时,劣于PV1的数值;当PV1的数值比PV2高时,劣化。换句话说,如果BR-Root1 < BR-Root2,则 PV1 优于 PV2 ,或者,如果它们相同,如果Cost1 < Cost2,或者,如果它们也相同,如果BR-ID1 < BR-ID2,或者,当两个网桥 ID 也匹配时,当 Port-ID1 < Port-ID2时。

Given two priority vectors PV1=[BR-Root1, Cost1, BR-ID-1, Port-ID1] and PV2=[BR-Root2, Cost2, BR-ID-2, Port-ID2], PV1 is said to be superior when it is a lower numeric value than PV2, and inferior when PV1 is a higher numeric value than PV2. In other words, PV1 is superior to PV2 if BR-Root1< BR-Root2, or, in case they are the same, if Cost1< Cost2, or, if they are the same too, if BR-ID1< BR-ID2, or, when the two bridge IDs match too, when Port-ID1<Port-ID2.

何时发送配置 BPDU

When to Transmit Configuration BPDUs

网桥从其指定端口发送配置 BPDU。它在以下情况下这样做:

A bridge transmits configuration BPDUs out of its designated ports. It does so in the following cases:

  • 根桥上运行一个定时器(Hello 定时器),该定时器定期到期并触发配置 BPDU 的发送。一个 BPDU 在其每个指定端口上传输。只有根桥生成新的 BPDU,但是当桥第一次启用时,它认为自己是根桥(因为它没有其他优先级向量可以与自己的比较)。因此,它将所有端口置于指定角色,启动其 Hello 计时器,并开始生成 BPDU(请参阅“根桥选择”部分)”部分)。

  • The root bridge runs a timer (the Hello timer) that expires regularly and triggers the transmission of configuration BPDUs. One BPDU is transmitted on each one of its designated ports. Only the root bridge generates new BPDUs, but when a bridge is first enabled, it thinks it is the root bridge (because it has no other priority vector to compare its own to). So it places all of its ports into the designated role, starts its Hello timer, and begins to generate BPDUs (see the section "Root Bridge Selection").

  • 非根网桥仅生成 BPDU 以响应在其根端口上收到的 BPDU;换句话说,它们中继 BPDU。非根桥发送的 BPDU 携带的信息与其收到的 BPDU 相同,但更新的以下字段除外(参见图 15-9):

    • 发送器的桥ID和端口ID被桥用其自己的信息替换。

    • 网桥将成本更新为它收到的成本和它接收到 BPDU 的本地网桥端口(其根端口)的成本之和。

    • 消息寿命根据“ BPDU 老化”部分中描述的逻辑进行更新。后一节解释了如何定义 DT 数量。

  • Nonroot bridges generate BPDUs only in response to ones they receive on their root ports; in other words, they relay BPDUs. BPDUs transmitted by nonroot bridges carry the same information as the BPDUs they received, with the exception of the following fields that they update (see Figure 15-9):

    • The transmitter's bridge ID and port ID are replaced by the bridge with its own information.

    • The bridge updates the cost to be the sum of the cost it received and the cost of the port on the local bridge (its root port) that it received the BPDU on.

    • The message age is updated according to the logic described in the section "BPDU Aging." The latter section explains how the DT quantity is defined.

无论桥是否是根桥,在以下情况下都会发送配置BPDU:

Regardless of whether a bridge is the root bridge, it transmits a configuration BPDU in the following cases as well:

无论出于何种原因从给定端口传输配置 BPDU,STP 都会强制执行速率限制:网桥每秒不能从其任何端口传输超过一个配置 BPDU(请参阅“传输配置 BPDU”一节

Regardless of why a configuration BPDU is transmitted out of a given port, the STP enforces rate limiting: a bridge cannot transmit more than one Configuration BPDU per second out of any of its ports (see the section "Transmitting Configuration BPDUs").

BPDU老化

BPDU Aging

由于 BPDU 仅由根桥生成,并且只有在其根端口上接收到 BPDU 后,其他桥才会重新生成 BPDU,因此应该清楚的是,根桥及其 BPDU 生成的信息到达所花费的时间叶桥是可变的。在稳定的网络上,时间主要取决于网桥的负载情况以及网桥处理 BPDU 的速度。

Because BPDUs are generated only by the root bridge, and are regenerated by the other bridges only upon the reception of a BPDU on their root port, it should be clear that the time taken by the information generated by the root bridge with its BPDUs to reach the leaf bridges is variable. On a stable network, the time depends mainly on how loaded the bridges are and how fast they can process BPDUs.

通过非根桥进行 BPDU 中继

图 15-9。通过非根桥进行 BPDU 中继

Figure 15-9. BPDU relaying via nonroot bridges

携带过时信息的 BPDU 不应用于构建无环拓扑。因此,配置 BPDU 具有一个名为“消息期限”的字段,接收网桥将其与另一个字段“最大期限”进行比较,以丢弃那些存在时间过长且优先级向量不可信的 BPDU。

BPDUs carrying stale information should not be used to build the loop-free topology. For that reason, configuration BPDUs have a field called Message Age that is compared by the receiving bridge against the other field, Max Age, to discard those BPDUs that have been around for too long and whose priority vector cannot be trusted.

消息年龄字段首先由根桥初始化为0,并在转发之前由每个非根桥更新(见图15-9中的DT))。消息年龄应该表示自原始根桥的 BPDU 生成以来已经过去的时间。然而,计算这个时间并不容易。例如,它应该考虑传输延迟和处理时间:换句话说,帧在媒体中从一个网桥的端口到下一个端口所花费的时间,以及每个网桥的内存中所花费的时间。桥处理并重新生成它。但商业网桥中的常见方法是简单地将消息年龄字段视为跳数,就像 IP 标头的生存时间 (TTL) 字段一样:入口 BPDU 的消息年龄字段递增 256 个刻度(即,1 秒)并复制到传出的 BPDU 中。这意味着 BPDU 将在最多 20 跳后被丢弃(Max Age 的默认值是 20 秒)。Linux 不使用消息年龄作为跳数,而是尝试尊重“”部分中描述的原始规则正在发送配置 BPDU。”

The Message Age field is first initialized to 0 by the root bridge, and is updated by each nonroot bridge prior to forwarding it (see DT in Figure 15-9). The Message Age is supposed to represent the time that has passed since the original root bridge's BPDU was generated. However, to calculate this time is not easy. It should, for example, account for both the transmission delays and the processing time: in other words, the time spent by the frame in the media going from one bridge's port to the next one, and the time spent in the bridge's memory while each bridge processes and regenerates it. But a common approach in commercial bridges is to simply treat the Message Age field as a hop count, just like the Time To Live (TTL) field of the IP header: the Message Age field of the ingress BPDU is incremented by 256 ticks (i.e., 1 second) and copied into the outgoing BPDUs. This means that a BPDU would be dropped after a maximum of 20 hops (20 seconds is the default value for Max Age). Linux does not use the message age as a hop count, but tries to respect the original rule described in the section "Transmitting Configuration BPDUs."

当网桥在其端口之一上接收到的 BPDU 尚未过期(即消息年龄小于最大年龄)时,网桥将启动消息年龄计时器,该计时器将在由消息年龄与消息年龄之间的差值指定的时间后到期。最大年龄和消息年龄。请参阅“定时器”部分,了解消息期限定时器到期时触发的操作。这确保了 BPDU 携带的信息在生成后 Max Age 秒内被丢弃,除非到那时得到确认。

When the BPDU received by a bridge on one of its port has not expired (i.e., the Message Age is less than the Max Age), the bridge starts a Message Age timer that will expire after an amount of time given by the difference between the Max Age and the Message Age. Refer to the section "Timers" for the actions triggered by the expiration of the Message Age timer. This ensures that the information carried by the BPDU is discarded Max Age seconds after its generation, unless it is confirmed by then.

定义活动拓扑

Defining the Active Topology

每个网桥在本地配置和通过入口配置 BPDU 接收到的信息的帮助下,能够完成以下任务:

Each bridge, with the help of the local configuration and the information received with the ingress configuration BPDUs, is able to accomplish the following:

  • 选举根桥

  • Elect the root bridge

  • 选择其端口之一作为根端口

  • Select one of its ports as the root port

  • 对于每个端口,标识该端口所属 LAN 的指定网桥和指定端口

  • For each port, identify the designated bridge and designated port for the LAN to which the port belongs

每次网络中发生可能需要更改拓扑的变化时,都需要这些任务(我将其称为配置更新) 。例如:

Those tasks, which I will refer to as a configuration update, are needed every time something changes in the network that may require a change in the topology. For instance:

  • 端口可以​​启用或禁用。

  • A port is either enabled or disabled.

  • 端口的消息期限计时器到期。在这种情况下,端口将重新启动(即分配指定的角色)。

  • A port's Message Age timer expires. In this case, the port is restarted (i.e., assigned the designated role).

  • 网桥的本地配置发生变化。

  • The local configuration of a bridge changes.

  • 桥接端口收到的配置 BPDU 与先前在同一端口上收到的配置 BPDU 相比,具有更高的优先级向量。

  • A bridge port receives a configuration BPDU with a superior priority vector compared to the one previously received on the same port.

请注意,配置更新是在配置更改或端口更改管理状态的网桥上触发的。其他网桥将在看到它们收到的 BPDU 携带的信息中反映的这些变化后跟随(如有必要)。

Note that a configuration update is triggered on the bridge where the configuration is changed or where a port changes administrative state. The other bridges will follow (if necessary) upon seeing these changes reflected in the information carried by the BPDUs they receive.

让我们一一看看配置更新的任务是如何处理的。

Let's see how the configuration update's tasks are taken care of, one by one.

根桥选择

Root Bridge Selection

我们在“网桥和端口 ID ”部分看到了如何定义网桥 ID。鉴于 MAC 地址在全球范围内是唯一的,仅基于使用 MAC 地址来选择根桥的算法就足以确保确定性选择。然而,添加优先级组件允许管理员通过为他们希望被选为根的网桥分配更高的优先级来强制采用他们喜欢的拓扑。他们甚至可以将战略优先级分配给不同的网桥,以便在当前根网桥出现故障时也可以强制指定网桥接管。

We saw in the section "Bridge and Port IDs" how a bridge ID is defined. An algorithm based only on the use of the MAC address for the selection of the root bridge would be sufficient to ensure a deterministic selection, given that MAC addresses are unique worldwide. However, the addition of the priority component allows administrators to force the topology they like by assigning higher priorities to those bridges they would like to be selected as root. They can even assign strategic priorities to different bridges so that they can also force a given bridge to take over in case the current root bridge fails.

当网桥首次启用时,它不知道任何有关拓扑的信息,因此认为它是根网桥。因此,它会将指定的角色分配给其所有端口,启动端口上的转发延迟定时器,以便它们最终将被分配为转发状态(见图15-6),使用网桥的ID作为根网桥开始传输BPDU ID 字段,根路径成本为 0。这是一种方便的方法,可以使其广播有关自身的数据并尽快传播该数据,以便它和其他网桥都可以发现真正最好的根网桥并重新平衡树。

When a bridge is first enabled, it does not know anything about the topology and therefore thinks it's the root bridge. It will therefore assign the designated role to all its ports, start the Forward Delay timer on the ports so that they eventually will be assigned the forwarding state (see Figure 15-6), and start transmitting BPDUs using the bridge's ID as the root Bridge ID field, and a root path cost of 0. This is a convenient way to make it broadcast data about itself and get that data spread around as quickly as possible so that both it and other bridges can discover the truly best root bridge and rebalance the tree.

当该网桥是具有最佳网桥 ID 的网桥时,它将继续在其指定端口上发送 BPDU,因为没有其他网桥可以声明更好的优先级向量(更准确地说,更好的网桥 ID),从而接管根角色。

When the bridge is the one with the best bridge ID, it will keep sending out BPDUs on its designated ports because no other bridge can claim a better priority vector (to be more exact, a better bridge ID) and therefore take over the root role.

如果网桥没有最好的网桥 ID,它最终将收到具有更好根网桥 ID 的配置 BPDU(即高级 BPDU),并且:

If the bridge did not have the best bridge ID, it will eventually receive a configuration BPDU with a better root bridge ID (i.e., a superior BPDU) and:

  • 接受并记录更好的信息(包括根桥ID和定时器)。

  • Accept and record the better information (including the root bridge ID and timers).

  • 相应地更新其端口的状态和角色。这就是所谓的配置更新。

  • Update the state and role of its ports accordingly. This is what is called a configuration update.

根端口选择

Root Port Selection

每个网桥必须选择自己的根端口,正如我们在“端口角色”部分中预期的那样,它是到根桥具有最短路径(或最低成本)的端口。根桥是唯一没有根端口的桥;非根网桥有且只有一个根端口。

Each bridge must select its own root port , which, as we anticipated in the section "Port roles," is the port with the shortest path (or lowest cost) to the root bridge. The root bridge is the only one that does not have a root port; nonroot bridges have one and only one root port.

对于每个端口,除了管理上禁用的端口外,网桥都会保留随入口 BPDU 接收到的最佳优先级向量的副本。这样,网桥就知道对于每个端口,到达根网桥的最佳(最低成本)路径是什么。

For each of its ports, with the exception of the ones that are administratively disabled, a bridge keeps a copy of the best priority vector received with ingress BPDUs. This way, the bridge knows, for each port, what is the best (lowest cost) path to reach the root bridge.

根端口的选择只需遍历所有端口并选择具有最佳优先级向量的端口即可。如果多个端口恰好共享相同的最佳优先级向量,则选择分配端口 ID 最低的本地端口,如图15-10所示 (请注意,接收方端口 ID 不是 BPDU 的一部分)。

The selection of the root port consists simply of going through all the ports and selecting the one with the best priority vector. If more than one port happens to share the same best priority vector, the local port with the lowest assigned port ID is selected, as shown in Figure 15-10 (note that the receiver port ID is not part of the BPDU).

根端口选择的多个候选者

图 15-10。根端口选择的多个候选者

Figure 15-10. Multiple candidates for the root port selection

指定端口选择

Designated Port Selection

虽然每个网桥可以有一个根端口,但每个 LAN 只能有一个指定端口。STP 确保每个网桥选择相同的端口。指定端口应该是到根桥的路径成本最低的端口。因此,它是具有最佳优先级向量的端口。

While there can be one root port per bridge, there is only one designated port per LAN. The STP ensures that each bridge chooses the same port. The designated port should be the one that has the lowest path cost to the root bridge. Thus, it's the port with the best priority vector.

每个网桥通常位于多个 LAN 上,因此它必须了解每个 LAN 的指定端口。

Each bridge is usually on more than one LAN, so it must learn the designated port for each LAN.

在两个网桥之间的点对点连接上,只有两个端口。选择传输具有最佳优先级向量的 BPDU 的 BPDU。相比之下,诸如以太网集线器之类的共享介质可能具有两个以上的网桥。在这种情况下,每个网桥将接收彼此的 BPDU,并通过检查优先级向量来选择正确的指定端口。

On a point-to-point connection between two bridges, there are just two ports. The one that transmits BPDUs with the best priority vector is selected. By contrast, a shared medium such as an Ethernet hub may have more than two bridges. In that case, each bridge will receive each other's BPDUs and, by checking the priority vector, elect the right designated port.

图 15-11显示了使用共享介质连接网桥时会发生什么情况。最初只有 BR2 连接到集线器,因此它选择自己作为根桥。当管理员稍后添加 BR1 时,它也认为它是根桥,如图15-11(b)中它发送的 BPDU 所示。假设 BR2 的 ID 高于(即低于)BR1 的 ID,因此两个网桥最终同意 BR1 作为根网桥。

Figure 15-11 shows what would happen when you use a shared medium to connect bridges. Initially only BR2 is connected to the hub and therefore it elects itself as the root bridge. When an administrator later adds BR1, it also thinks it is the root bridge, as you can see from the BPDUs it transmits in Figure 15-11(b). Let's assume that BR2's ID is higher than (i.e., inferior to) BR1's ID, and therefore that the two bridges end up agreeing on BR1 as the root bridge.

由于所有这些网桥端口都连接到同一集线器,因此当 BR1 的端口 1 传输 BPDU 时,BR1 在端口 2 上接收自己的 BPDU,反之亦然。然而,基于最佳优先级向量选择指定端口也适用于这种情况:优先级向量的第四个字段(即端口 ID)使端口 1 的 BPDU 优先级向量成为最佳。

Because all of these bridge ports connect to the same hub, when BR1'sport 1 transmits a BPDU, BR1 receives its own BPDU on port 2, and vice versa. However, the selection of the designated port based on the best priority vector works in this scenario, too: the fourth field of the priority vector, which is the port ID, makes port 1's BPDU priority vector the best.

然而,由于多种原因,共享介质设置并不受欢迎,因此在本章的其余部分中,我将仅提及点对点的情况。

However, the shared-medium setup is unpopular for several reasons, so in the rest of this chapter, I will refer only to the point-to-point case.

指定端口选择

图 15-11。指定端口选择

Figure 15-11. Designated port selection

STP 实际应用示例

Examples of STP in Action

假设我们有图 15-12中的拓扑。请注意,由于没有冗余链路,因此不需要 STP。我们假设:

Let's suppose we had the topology in Figure 15-12. Note that since there are no redundant links, there would be no need for the STP. Let's assume:

  • 网桥 ID BR1 < 网桥 ID BR2 < 网桥 ID BR4(因此 BR1 是根网桥)。

  • Bridge ID BR1 < Bridge ID BR2 < Bridge ID BR4 (so BR1 is the root bridge).

  • 每个网桥都可以独立于其他网桥配置其本地接口的成本。为了简单起见并使该图更易于阅读,我们假设所有路径成本都是对称的(每个链路两侧相同)。[ * ]

  • Each bridge can configure the cost of its local interfaces independently from the other bridges. For simplicity and to make the figure easier to read, let's just assume that all the path costs are symmetric (the same on both sides of each link).[*]

更新根路径成本

图 15-12。更新根路径成本

Figure 15-12. Updating the root path cost

注意:

Note that:

  • 每次Hello定时器超时,BR1的指定端口都会定时发送配置BPDU。由于 BR2 会在其端口 1 的每个 Hello 时间定期接收配置 BPDU,因此它也会在每个 Hello 时间或多或少地定期在其端口 2 上重新生成(转发)配置 BPDU。

  • The designated port of BR1 regularly transmits a configuration BPDU every time the Hello timer expires. Because BR2 receives configuration BPDUs regularly at every Hello time on its port 1, it regenerates (forwards) a configuration BPDU on its port 2 more or less regularly at every Hello time as well.

  • 网桥1的配置BPDU通告:

    • BR1作为根桥

    • 根路径成本为 0

    • 自己的网桥ID BR1

    • 端口ID为1

  • Bridge 1's configuration BPDU advertises:

    • BR1 as the root bridge

    • A root path cost of 0

    • Its own bridge ID BR1

    • A port ID of 1

  • 网桥2的配置BPDU通告:

    • BR1作为根桥

    • 根路径成本为 10(它将自己的成本添加到 BR1 发出的成本中)

    • 自己的网桥ID BR2

    • 端口ID为2

  • Bridge 2's configuration BPDU advertises:

    • BR1 as the root bridge

    • A root path cost of 10 (it adds its own cost to the one sent out by BR1)

    • Its own bridge ID BR2

    • A port ID of 2

现在我们添加一个名为 BR3 的新网桥,并假设网桥 ID BR3 < 网桥 ID BR4,如图15-13所示。

Now let's add a new bridge named BR3, and assume that Bridge ID BR3 < Bridge ID BR4, as in Figure 15-13.

正如我们在“根桥选择”部分中所解释的,当 BR4 首次启用时,它认为自己是根桥,因此它将指定的角色分配给其两个端口。它在每个端口上发出配置 BPDU,将自己通告为根桥,因此在 BPDU 中使用 0 的根路径成本。

As we explained in the section "Root Bridge Selection," when BR4 is first enabled it thinks it is the root bridge, and therefore it assigns the designated role to its two ports. It sends out a configuration BPDU on each port, advertising itself as the root bridge, and therefore using a root path cost of 0 in the BPDUs.

将网桥添加到稳定拓扑

图 15-13。将网桥添加到稳定拓扑

Figure 15-13. Adding a bridge to a stable topology

如果我们假设 BR3 通过点对点链路连接到 BR1 和 BR4,如图 15-13所示,当 BR3 上电时,BR1 和 BR4 将启用其与 BR3 连接的端口,为这些端口分配指定角色,并开始发送配置BPDU。

If we assume BR3 to be connected to BR1 and BR4 with a point-to-point link, as in Figure 15-13, when BR3 is powered up, BR1 and BR4 will enable their ports connected to BR3, assign these ports the designated role, and start transmitting configuration BPDUs.

我们看看BR1、BR3、BR4收到对方的配置BPDU后有何反应:

Let's see how BR1, BR3, and BR4 react upon receiving each other's configuration BPDUs:

  • 来自 BR1 和 BR4 的配置 BPDU 将分别具有以下优先级向量:[BR1, 0, BR1, 2][BR1, 110, BR4, 2]

  • The configuration BPDUs from BR1 and BR4 will have the following priority vectors, respectively: [BR1, 0, BR1, 2] and [BR1, 110, BR4, 2].

  • 由于 BR1 从 BR3 接收到的 BPDU 具有较低的优先级向量,因此 BR1 将其端口 2 保持为 DESIGNATED 角色并维持其根桥角色。另一方面,当 BR3 收到来自 BR1 的 BPDU 时,它意识到 BR1 具有更好的网桥 ID(因此也有更好的优先级向量),因此更新其端口 1 的优先级向量,选择端口 1 作为其根端口,并选择 BR1作为根桥。

  • Because the BPDU that BR1 receives from BR3 has an inferior priority vector, BR1 keeps its port 2 in the DESIGNATED role and maintains its root bridge role. On the other hand, when BR3 receives the BPDU from BR1, it realizes that BR1 has a better bridge ID (and thus a better priority vector) and therefore updates its port 1's priority vector, selects port 1 as its root port, and selects BR1 as the root bridge.

  • BR3 从 BR4 接收到的 BPDU 具有比 BR3 发送到 BR4 的优先级向量更好的优先级向量,但不如从 BR1 接收到的 BR3。因此,BR4 不会更改其当前的根端口和根桥信息:端口 1 仍然是根端口,BR1 仍然是根桥。

  • The BPDU that BR3 receives from BR4 has a better priority vector than the one BR3 sent to BR4, but not as good as the one BR3 received from BR1. Because of that, BR4 does not change its current root port and root bridge information: port 1 is still the root port and BR1 is still the root bridge.

  • 当BR3向BR4发送新的BPDU时,如图15-13(b)所示,它使用反映从BR1获取的新信息的新优先级向量。收到该 BPDU 后,BR4 识别出较高的优先级向量,并阻塞其端口 2。请注意,BR3 的优先级向量胜过 BR4 的优先级向量,因为其路径成本较低(即,BR3 的端口 2 被选为 LAN 指定端口,因为它比 BR4 的端口 2 更靠近根桥)。

  • When BR3 transmits a new BPDU to BR4, as in Figure 15-13(b), it uses a new priority vector that reflects the new information acquired from BR1. Upon receiving that BPDU, BR4 recognizes the superior priority vector and it blocks its port 2. Note that BR3's priority vector wins over BR4's priority vector because of its lower path cost (i.e., BR3's port 2 is selected as the LAN-designated port because it is closer to the root bridge than BR4's port 2).

  • BR4 选择端口 1 作为其根端口,因为它是接收更好优先级向量的端口(请记住,我们假设 BR2 的网桥 ID 低于 BR3 的网桥 ID)。

  • BR4 selects port 1 as its root port because it is the one that receives the better priority vector (remember that we assumed that BR2's bridge ID is lower than BR3's bridge ID).

如果将 BR4 从 BR2 收到的配置 BPDU 与从 BR3 收到的配置 BPDU 进行比较,您可以看到它们共享相同的根桥 ID (BR1) 和相同的根路径开销 (10),但优先级向量优于 BR2,因为 BR2 的网桥 ID 小于 BR3。因此,BR4 选择端口 1 作为其根端口。如果管理员更喜欢 BR4 到 BR3 的链路而不是到 BR2 的链路,则只需在该端口上配置较低的成本(请参阅“桥接器和端口 ID ”部分)。

If you compare the configuration BPDU that BR4 receives from BR2 to the one it receives from BR3, you can see that they share the same root bridge ID (BR1) and the same root path cost (10), but that the third component of the priority vector is better than BR2's, because BR2's bridge ID is less than BR3's. BR4 therefore selects port 1 as its root port. An administrator who had a preference for BR4's link to BR3 over the one to BR2 would simply have to configure a lower cost on that port (see the section "Bridge and Port IDs").

在此示例中,优先级向量的前三个分量足以选择根端口和指定端口。现在让我们看看图 15-14中的示例 ,何时需要第四个端口 ID 作为决定因素。

In this example, the first three components of the priority vector were sufficient for the selection of the root and designated ports. Let's see now, with the example in Figure 15-14, when the fourth one, the port ID, is needed as a tiebreaker.

现在,BR4 从 BR2 收到两个 BPDU,其优先级向量的前三个字段具有相同的值。然而,第四个参数(端口 ID)允许 BR4 选择其端口 1 作为其根端口。在“根端口选择”部分中,我们还了解了当入口 BPDU 的优先级向量的所有四个组成部分时,网桥如何使用本地端口 ID(而不是作为优先级向量一部分的远程端口 ID)作为决胜局。不足以识别获胜的 BPDU。

Now BR4 receives two BPDUs from BR2 with the same values in the first three fields of the priority vector. However, the fourth parameter (the port ID) allows BR4 to select its port 1 as its root port. In the section "Root Port Selection," we also saw how a bridge uses the local port ID (as opposed to the remote port ID that is part of the priority vector) as the tiebreaker when all four components of the priority vectors of ingress BPDUs are not sufficient to identify a winning BPDU.

端口 ID 作为决定因素

图 15-14。端口 ID 作为决定因素

Figure 15-14. Port ID as the tiebreaker

定时器

Timers

STP使用每桥和每端口定时器。在表 15-4和表15-5中,您可以分别查看每个网桥和每个端口计时器的默认超时以及允许的值。[ * ]

The STP uses both per-bridge and per-port timers . In Tables 15-4 and 15-5, you can see the default timeouts, and what the allowed values are, for per-bridge and per-port timers, respectively.[*]

表 15-4。桥定时器

Table 15-4. Bridge timers

定时器

Timer

默认值(以秒为单位)

Default value (in seconds)

允许范围

Allowed range

a请参阅“拓扑更改”部分。

a See the section "Topology Changes."

你好

Hello

2

2

1-10

1-10

拓扑变化

Topology Change

转发延迟 + 最大年龄

Forward Delay + Max Age

不可配置

Not configurable

TCN

TCN

你好时间

Hello time

不可配置

Not configurable

解决老化问题

Addresses Aging

300 或转发延迟a

300 or Forward Delaya

不可配置

Not configurable

表 15-5。端口定时器

Table 15-5. Port timers

定时器

Timer

默认值(以秒为单位)

Default value (in seconds)

允许范围

Allowed range

留言年龄

Message Age

20

20

6-40

6-40

转发延迟

Forward Delay

15

15

4-30

4-30

抓住

Hold

1

1

不可配置

Not configurable

请注意,并非所有计时器都是用户可配置的。另请注意,某些计时器共享相同的配置(例如 TCN 和 Hello 计时器),因此计时器的配置更改也可能会影响其他计时器。

Note that not all timers are user configurable. Also note that some timers share the same configuration (the TCN and Hello timers, for example) so that a configuration change for a timer may affect others as well.

这些是桥定时器:

These are the bridge timers:

你好
Hello

用于定期生成配置BPDU。只有根桥使用它。

Used to regularly generate configuration BPDUs. Only the root bridge uses it.

TCN
TCN

由已检测到拓扑更改且必须通知根桥的网桥使用。请参阅“拓扑更改”部分。

Used by a bridge that has detected a topology change and must notify the root bridge about it. See the section "Topology Changes."

拓扑变化
Topology change

由根网桥用来记住在其配置 BPDU 中设置特定标志。该标志用于通知其他网桥有关拓扑更改的信息。请参阅“拓扑更改”部分。

Used by the root bridge to remember to set a specific flag in its configuration BPDUs. This flag is used to notify the other bridges about a topology change. See the section "Topology Changes."

老化定时器
Aging timer

用于从转发数据库中清除过时的地址。无论是否使用 STP,网桥都会使用此计时器。请参阅“短老化定时器”部分。

Used to clean up stale addresses from the forwarding database. This timer is used by the bridge regardless of whether the STP is used. See the section "Short Aging Timer."

每个网桥保留其计时器配置的两份副本:一份由管理员在本地配置,一份从根网桥接收。

Each bridge keeps two copies of its timer configuration: the one configured locally by the administrator, and the one received from the root bridge.

根桥是唯一使用自己配置的定时器的桥;它使所有其他网桥都采用其配置。非根网桥使用它们在根端口上接收到的 BPDU 携带的计时器配置。您可以在图 15-8中看到计时器配置的位置。

The root bridge is the only one that uses its own configured timers; it makes all the other bridges adopt its configuration. Nonroot bridges use the timer configurations carried by the BPDUs they receive on their root ports. You can see where timer configuration is located in Figure 15-8.

这些是端口定时器:

These are the port timers:

留言年龄
Message Age

我们在“ BPDU 老化”一节中看到,BPDU 携带的信息的生命周期是有限的。消息时代计时器用于强制该生命周期。每次端口上收到 BPDU 时,定时器都会重新启动。每当收到 BPDU 时,都会将其消息年龄与网络的最大年龄进行比较,如果太旧,则丢弃该帧。消息期限计时器在非指定端口(即接收高级 BPDU 的端口)上运行。

在没有问题的稳定网络中,这个计时器永远不会过期。但是,当根桥无法生成BPDU,或者收到的BPDU 过期或由于某种原因被丢弃时,定时器就会超时。当定时器到期时,端口将重新启动,并因此分配指定的角色。

We saw in the section "BPDU Aging" that the information carried by a BPDU has a limited lifetime. The Message Age timer is used to enforce that lifetime. The timer is restarted each time a BPDU is received on the port. Whenever a BPDU is received, its message age is compared to the network's max age and the frame is dropped if it is too old. The Message Age timer runs on nondesignated ports (i.e., the ones that receive superior BPDUs).

In a stable network without problems, this timer will never expire. However, when the root bridge fails to generate BPDUs, or the latter are received expired or get dropped for some reason, the timer will expire. When the timer expires, the port is restarted, and therefore assigned the designated role.

转发延迟
Forward Delay

负责从聆听到学习、从学习到转发的状态转换。图 15-15显示了通常如何处理转发延迟定时器的到期以及它如何遵循图 15-6的模型。

处理转发延迟计时器

图 15-15。处理转发延迟计时器

Takes care of the state transitions from listening to learning, and from learning to forwarding. Figure 15-15 shows how expiration of the Forward Delay timer is typically handled and how it follows the model of Figure 15-6.

Figure 15-15. Handling the Forward Delay timer

抓住
Hold

每个端口上配置 BPDU 的传输速率限制为每秒 1 个。在稳定的网络(即 STP 已收敛的网络)上,每个指定端口在每个 Hello 时间都会传输一个 BPDU。然而,当拓扑发生变化时,由于 STP 算法的分布式特性,在复杂场景中收敛到新拓扑可能需要几分钟的时间。因此,根据“何时发送配置 BPDU ”一节的规则发送的 BPDU 数量很容易变大,此时速率限制更有可能生效。

保持定时器需要时,在指定端口(传输配置 BPDU 的端口)上运行。

The transmission of configuration BPDUs is rate limited on each port to one per second. On a stable network—that is, one where STP has converged—each designated port transmits a BPDU at every Hello time. However, when a change in the topology occurs, the convergence to the newer topology can take minutes in complex scenarios due to the distributed nature of the STP algorithm. Because of that, the number of BPDUs sent according to the rules of the section "When to Transmit Configuration BPDUs" can easily get large, and it is here that rate limiting is more likely to kick in.

The Hold timer , when needed, runs on designated ports (the ones transmitting configuration BPDUs).

每端口定时器共享配置。例如,您不能在两个不同的端口上有两个不同的 Max Age 配置。

Per-port timers share configurations. For instance, you cannot have two different Max Age configurations on two different ports.

避免临时循环

Avoiding Temporary Loops

根端口和指定端口是唯一被分配转发状态的端口。然而,当端口被指定为根角色或指定角色时,它并不会立即被指定为转发状态:它首先必须经历两个中间状态,如图15-6所示。这些中间状态抑制了临时环路的风险,同时网络收敛到稳定的无环路拓扑。我们以 图15-16(a)的简单场景为例。

The root and designated ports are the only ones that are assigned the forwarding state. When a port is assigned the root or designated role, however, it is not assigned the forwarding state right away: it first has to go through two intermediate states, as shown in Figure 15-6. These intermediate states suppress the risk of temporary loops while the network converges toward a stable loop-free topology. Let's use the simple scenario of Figure 15-16(a) as an example.

该拓扑由通过两条链路连接的两个网桥组成。必须禁用两个链接之一;否则就会出现循环。

The topology consists of two bridges connected by two links. One of the two links must be disabled; otherwise, there would be a loop.

我们之前看到,当网桥的端口首次启用时,它们会被分配指定的角色和阻塞状态。我们还在“根桥选择”部分中看到,当桥首次启用时,它对其邻居桥没有任何了解,因此它认为自己是根桥。图 15-16(a)显示了两个刚刚启用的网桥,因此:

We saw earlier that when a bridge's ports are first enabled, they are assigned the designated role and blocking state. We also saw in the section "Root Bridge Selection" that when a bridge is first enabled, it does not have any knowledge about its neighbor bridges and therefore it thinks it is the root bridge. Figure 15-16(a) shows two bridges that have just been enabled, and therefore:

  • 他们都认为自己是根桥。

  • They both think they are the root bridge.

  • 两个网桥的两个端口都被分配指定角色和阻塞状态。

  • Both ports of both bridges are assigned the designated role and the blocking state.

  • 对于每个端口,它们将状态更改为侦听,启动转发延迟计时器,并传输配置 BPDU。这些 BPDU 的优先级向量反映了它们的假设:它们的网桥是根网桥。

  • For each port, they change the state to listening, start the Forward Delay timer, and transmit a configuration BPDU. The priority vectors of those BPDUs reflect their assumption that their bridges are the root bridge.

请注意,尚未有任何端口进行转发。这些端口既不能接收也不能发送数据流量。只能发送和接收 BPDU。

Note that none of the ports is forwarding yet. Data traffic can be neither received nor transmitted on those ports. Only BPDUs can be transmitted and received.

当 BR2 在其端口上收到 BR1 的 BPDU 时,它意识到 BR1 具有更高的优先级向量(确切地说是更好的网桥 ID)。此时,BR2 开始配置更新:选择根桥、根端口和指定端口,并更新其所有端口的状态。特别是,它选择 BR1 作为根桥,选择端口 1 作为根端口(因为它是收到具有最佳优先级向量的 BPDU 的端口)。端口 2 既不是根端口也不是指定端口,因此被阻止(即,它被排除在树之外)。当端口 1 被分配新角色时,其转发延迟计时器将重新启动。当端口 2 被阻塞时,其转发延迟计时器停止。

When BR2 receives BR1's BPDUs on its ports, it realizes that BR1 has a superior priority vector (a better bridge ID, to be exact). At that point, BR2 starts a configuration update: it selects the root bridge, the root port, and the designated ports, and updates the state of all its ports. In particular, it selects BR1 as the root bridge and port 1 as the root port (because it is the port where it has received the BPDU with the best priority vector). Port 2 is neither a root port nor a designated port and is therefore blocked (i.e., it is left out of the tree). When port 1 is assigned the new role, its Forward Delay timer is restarted. When port 2 is blocked, its Forward Delay timer stops.

转换到转发状态

图 15-16。转换到转发状态

Figure 15-16. Transition to forwarding state

假设这些操作发生得相当快,您可以假设所有端口上的转发延迟计时器将或多或少同时到期,从而导致图 15-16(b)中的新配置。请注意,现在:

Supposing these actions took place pretty quickly, you can assume the Forward Delay timer will expire more or less at the same time on all ports, leading to the new configuration in Figure 15-16(b). Note that now:

  • 转发延迟计时器到期的三个端口将转至学习状态(它们尚未转发数据流量)。

  • The three ports whose Forward Delay timers expired are moved to the learning state (they are not forwarding data traffic yet).

  • 转发延迟计时器在这三个端口上重新启动。

  • The Forward Delay timers are restarted on those three ports.

  • BR2不再发送配置BPDU(因为它没有指定端口)。

  • BR2 does not transmit configuration BPDUs anymore (because it does not have a designated port).

当转发延迟定时器在 15 秒后再次到期时,BR1 的端口 1 和 2 以及 BR2 的端口 1 被分配为转发状态。此时,拓扑结构稳定。在这个简单的场景中,拓扑收敛得相当快,但由于在更复杂的设置上可能需要更长的时间,因此阻塞和转发之间的中间状态可确保避免临时循环。

When the Forward Delay timer expires again after 15 seconds, BR1's ports 1 and 2 and BR2's port 1 are assigned the forwarding state. At this point, the topology is stable. In this simple scenario, the topology converged pretty quickly, but since it may take significantly longer on more complex setups, the intermediate states between blocking and forwarding ensure that temporary loops are avoided.

注意图15-16(c)中BR1的端口2是转发的。只要链路的一侧被阻塞(图15-16(c)中BR2的端口2 ),就不会有形成环路的危险。即使 BR2 的端口 2 被禁用,BR1 的端口 2 仍在转发流量:BR1 的端口 2 和 BR2 的端口 2 可能与其他主机一起连接到集线器,并且需要 BR2 的端口 2 来提供与其他主机的连接。

Note that BR1's port 2 is forwarding in Figure 15-16(c). There is no danger of causing a loop as long as one side of the link is blocked (BR2's port 2 in Figure 15-16(c)). BR1's port 2 is still forwarding traffic, even though BR2's port 2 is disabled: BR1's port 2 and BR2's port 2 might be connected to a hub along with other hosts, and BR2's port 2 is needed to provide connectivity to those other hosts.

拓扑变化

Topology Changes

拓扑更改是更改 L2 网络上的系统或其端口连接方式的事件。当拓扑发生变化时,过去可以通过给定路径访问的地址现在可以通过不同的路径访问。因此,必须正确处理拓扑变化,以保持网络环路畅通并更新转发数据库。就图和树而言,当您添加或删除链接或选择不同的节点作为树的根时,拓扑会发生变化(请记住,生成树是根据树根的选择来计算的)。

A topology change is an event that changes which systems are on an L2 network, or how their ports are connected. When the topology changes, an address that used to be reachable through a given path may now be reachable through a different one. So a change in the topology must be handled properly to keep the network loop free and update the forwarding databases. In terms of graphs and trees, the topology changes when you add or remove a link, or select a different node as the tree's root (remember that the spanning tree is calculated based on the selection of the tree's root).

我们先看看触发拓扑变化的事件,然后看看它们是如何处理的:

Let's first see the events that trigger a topology change, and then how they are handled:

非转发网桥端口将状态更改为转发,反之亦然
A nonforwarding bridge port changes state to forwarding, or vice versa

这种情况包括启用已禁用的端口和仅由于协议决定而更改状态的端口。从树的角度来看,这相当于向树添加或删除链接。

This case includes a disabled port that is enabled and a port that simply changes state due to a protocol decision. From a tree's perspective, this is equivalent to adding a link to the tree or removing one.

根桥ID变化
The root bridge ID changes

例如,可能会发生这种情况,因为当前的根网桥已关闭(因此另一根网桥已接管根角色),或者因为已启用更好的根网桥,或者因为当前的根网桥或另一网桥已更改它的优先级。根桥的更改可能会触发整个网络中端口状态和角色的大量变化,具体取决于新根桥相对于旧根桥的位置。理论上,根节点的更改可以生成非常不同的树,但实际上,网桥是使用我们在“网桥和端口 ID”部分中看到的参数进行配置的”这样拓扑的改变就不会涉及到树的重大改变。

This can happen, for example, because the current root bridge has been shut down (and therefore another one has taken over the root role), or because a better one has been enabled, or because either the current root bridge or another bridge has changed its priority. A change of the root bridge can, depending on where the new root bridge is located with respect to the old one, trigger quite a few changes of port state and roles all over the network. In theory, a change of root node can produce a very different tree, but in practice, bridges are configured using the parameters we saw in the section "Bridge and Port IDs" so that topology changes do not involve major changes in the tree.

桥接端口收到 TCN 拓扑更改
A TCN topology change is received on a bridge port

在这种情况下,拓扑变化已被另一个网桥检测到。请参阅“让所有网桥了解拓扑更改”部分。

In this case, the topology change has been detected by another bridge. See the section "Letting All Bridges Know About a Topology Change."

请注意,给定无环路拓扑,只能通过添加链路(即新端口进入转发状态)而不是删除链路(即转发端口将其状态更改为阻塞)来创建环路。删除链路只能对树进行分区,而向树添加链路总是会产生环路,除非另一个端口同时被禁用或阻止。

Note that given a loop-free topology, you can create a loop only by adding a link (i.e., a new port enters the forwarding state), not by removing a link (i.e., a forwarding port changes its state to blocking). Removing a link can only partition the tree, whereas adding a link to a tree always creates a loop unless another port has been simultaneously disabled or blocked.

短老化定时器

Short Aging Timer

我之前说过,当检测到拓扑变化时,转发数据库也需要更改。让我们通过一个例子来看看为什么。假设图 15-2(a)中 A2 和 D1 之间的链路由于某种原因发生故障。所有连接到访问网桥 A2 的主机将无法再从 D1 访问,而应使用 D2。STP 将通过使网桥进行配置更新来负责更新拓扑(请参阅“定义活动拓扑”部分)。例如,新的拓扑可能如图15-17所示。作为示例,该图还显示了主机 40 和主机 11 之间的新路径。

I said earlier that when a topology change is detected, the forwarding database needs to be changed, too. Let's see why with an example. Let's suppose that the link between A2 and D1 in Figure 15-2(a) failed for some reason. All the hosts connected to the access bridge A2 would not be reachable anymore from D1, and D2 should be used instead. The STP will take care of updating the topology by making the bridges go through a configuration update (see the section "Defining the Active Topology"). The new topology could, for instance, look like Figure 15-17. The figure also shows, as an example, the new path between host 40 and host 11.

STP 还将确保更新 D2 中的陈旧信息,该信息表明主机 11-20 可通过其连接到 C1 的端口进行访问。过时的信息实际上不仅存在于这座桥中,而且还存在于桥中。其他网桥的转发数据库也需要清理。而且,当拓扑发生变化时,STP需要收敛到新的无环拓扑。在此期间,网桥端口可能会多次更改角色和状态,因此转发数据库的内容也会更改。

STP also will make sure to update the stale information in D2 that says that Hosts 11-20 are reachable via its port connected to C1. Stale information is actually not only in that bridge; the forwarding database of other bridges also needs to be cleaned up. Moreover, when there is a change in the topology, the STP needs to converge to a new loop-free topology. During that time, bridge ports may change role and state several times, and thus so will the contents of the forwarding databases.

通过减少时间来清除转发数据库中的陈旧信息,之后数据库中的地址如果不使用则被删除。这是通过在网桥收到有关拓扑更改的通知时将老化计时器(默认为 5 分钟)减少为转发延迟(即默认为 15 秒)来实现的。通过在配置 BPDU 中设置特殊标志来通知拓扑更改(请参阅下一节)。

Stale information in the forwarding database is cleaned up by reducing the time after which an address in the database is removed if it is not used. This is carried out by reducing the Aging timer, which is 5 minutes by default, to the Forward Delay (i.e., 15 seconds by default) when a bridge is notified about a topology change. Topology changes are notified by setting a special flag in the configuration BPDUs (see the next section).

处理 A2 上的根端口故障

图 15-17。处理 A2 上的根端口故障

Figure 15-17. Handling a root port failure on A2

让所有网桥知道拓扑变化

Letting All Bridges Know About a Topology Change

当网桥检测到拓扑变化时,必须通知所有网桥,以便它们可以开始使用短时效来清理其转发数据库中的陈旧条目。让我们看看这是如何实现的:

When a topology change is detected by a bridge, all bridges must be notified so that they can start using short aging to clean up stale entries in their forwarding databases. Let's see how this is accomplished:

  1. 桥将拓扑变化通知根桥。

  2. The bridge notifies the root bridge about the topology change.

  3. 根桥将拓扑变化通知所有桥。

  4. The root bridge notifies all the bridges about the topology change.

第一步是通过 TCN BPDU 完成的。检测到拓扑变化的网桥通过根端口向其指定网桥发送TCN BPDU。网桥在每个 Hello 时间发送 TCN BPDU,直到指定网桥确认其接收。指定网桥通过在其下一个配置 BPDU 中设置 TCA 标志来确认 TCN BPDU 的接收。此时,指定桥重复相同的过程,通过根端口等向其指定桥发送TCN BPDU。当TCN最终到达根桥时,该过程结束。当根桥本身检测到拓扑变化时,不需要使用TCN BPDU(因为根桥不需要通知自己)。

The first step is done with TCN BPDUs. The bridge that detects the topology change sends a TCN BPDU to its designated bridge through the root port. The bridge sends a TCN BPDU at every Hello time until the designated bridge acknowledges its reception. The designated bridge acknowledges the reception of the TCN BPDU by setting the TCA flag in its next configuration BPDU. At this point, the designated bridge repeats the same process by sending a TCN BPDU to its designated bridge through the root port, etc. This process ends when the TCN finally makes it to the root bridge. The use of the TCN BPDUs is not needed when the topology change is detected by the root bridge itself (because the root bridge does not need to notify itself).

第二步由根桥通过在其传输的配置 BPDU 中设置特殊标志 (TC) 来完成。该标志将在非根网桥重新生成的 BPDU 中保持打开状态,以便网络中的所有网桥最终都会收到拓扑更改通知。当网桥看到此标志设置时,它会启动短老化计时器。

The second step is done by the root bridge by setting a special flag (TC) in its transmitted configuration BPDUs. This flag will be kept toggled on in the BPDUs regenerated by the nonroot bridges so that all bridges in the network will eventually receive the topology change notification. When a bridge sees this flag set, it starts the Short Aging timer.

拓扑更改示例

Example of a Topology Change

如果我们采用图15-2(a)的场景,想象关闭A2和D1(即根端口)之间的链路,A2将选择另一个端口作为根端口,这会将状态从阻塞更改为转发。这将导致图 15-18(a)中的新场景。

If we take the scenario of Figure 15-2(a) and imagine shutting down the link between A2 and D1 (i.e., the root port), A2 would elect the other port as the root port, which would change the state from blocking to forwarding. This would lead to the new scenario in Figure 15-18(a).

TCN BPDU 的使用

图 15-18。TCN BPDU 的使用

Figure 15-18. Use of the TCN BPDU

A2 启动 TCN 计时器并从其(新)根端口发送 TCN BPDU。当D2收到TCN BPDU时,它通过发回带有TCA标志的配置BPDU来确认接收设置,启动 TCN 定时器,并从其根端口发送 TCN BPDU。当 A2 收到来自 D2 的确认时,它停止其 TCN 计时器。D2收到C1的确认后也会做同样的事情。

A2 starts the TCN timer and transmits a TCN BPDU out of its (new) root port. When D2 receives the TCN BPDU, it acknowledges the reception by sending back a configuration BPDU with the TCA flag set, starts the TCN timer, and transmits a TCN BPDU out of its root port. When A2 receives the acknowledgment from D2, it stops its TCN timer. D2 will do the same when it receives the acknowledgment from C1.

当C1收到TCN BPDU时,启动拓扑更改定时器,该定时器将保持活动状态35秒,并设置TC标志定时器挂起时发送的所有 BPDU(参见图 15-19)。拓扑更改计时器使用的 35 秒不是随机值:它是转发延迟加上最大期限(参见表15-4)。

When C1 receives the TCN BPDU, it starts the Topology Change timer, which will remain active for 35 seconds, and sets the TC flag on all BPDUs transmitted out while the timer is pending (see Figure 15-19). The 35 seconds used by the Topology Change timer is not a random value: it is the Forward Delay plus the Max Age (see Table 15-4).

TC 标志将沿着整个树传播,因为所有网桥都会中继从根网桥接收到的标志。当网桥发现入口 BPDU 上设置了此标志时,它会开始使用短老化计时器(如果尚未这样做)。一旦根桥上的拓扑更改计时器到期,根桥就会停止在其 BPDU 中设置 TC 标志。收到清除了 TC 标志的 BPDU 后,网桥将停止使用短老化定时器并开始使用默认老化。

The TC flag will be propagated down the entire tree because all bridges relay the flags received from the root bridge. When a bridge sees this flag set on an ingress BPDU, it starts using the Short Aging timer (if it has not done so already). Once the Topology Change timer expires on the root bridge, the latter stops setting the TC flags in its BPDUs. Upon receiving a BPDU with the TC flag cleared, a bridge stops using the Short Aging timer and starts using default aging.

请注意,网桥可以接收在不同端口上设置了 TC 标志的配置 BPDU。例如,图15-19中的D2从C1接收1个,从C2接收1个。这不是问题:在任何时刻,网桥都在使用默认老化定时器或短老化定时器,因此当已经使用短老化定时器的网桥收到带有 TC 标志的配置 BPDU 时,它不需要执行以下操作:任何事物。

Note that a bridge can receive configuration BPDUs with the TC flags set on different ports. For example, D2 in Figure 15-19 receives one from C1 and one from C2. This is not a problem: at any moment, the bridge is using either the default Aging timer or the Short Aging timer, so when a bridge already using the Short Aging timer receives a configuration BPDU with the TC flag, it does not need to do anything.

BPDU封装

BPDU Encapsulation

01:80:C2:00:00:00 到 01:80:C2:00:00:FF 范围内的 L2 多播地址由 IEEE 为标准协议保留。特别是,该范围的第一个地址 01:80:C2:00:00:00 由 802.1D STP 使用:配置和 TCN BPDU 都发送到该地址。该地址使网桥能够识别 BPDU。

The L2 multicast addresses in the range 01:80:C2:00:00:00 to 01:80:C2:00:00:FF are reserved by IEEE for standard protocols. In particular, the first address of the range, 01:80:C2:00:00:00, is used by the 802.1D STP: both configuration and TCN BPDUs are sent to this address. This address is what allows bridges to recognize BPDUs.

图15-20展示了封装的内容以太网帧内的 BPDU 看起来像。

Figure 15-20 shows what the encapsulation of a BPDU inside an Ethernet frame looks like.

关于LLC头的更多细节,可以参考第13章

For more details on the LLC header, you can refer to Chapter 13.

请注意,同一 IEEE 规范规定,01:80:C2:00:00:00 到 01:80:C2:00:00:0F 范围内的地址不应由运行 802.1D 协议的网桥中继:它们要么由目标协议在本地处理(如果已实现并启用),要么被丢弃。

Note that the same IEEE spec states that the addresses in the range 01:80:C2:00:00:00 to 01:80:C2:00:00:0F should not be relayed by a bridge running the 802.1D protocol: they are either processed locally by the destination protocol (if implemented and enabled) or dropped.

通知所有网桥拓扑变化

图 15-19。通知所有网桥有关拓扑变化的信息

Figure 15-19. Notifying all bridges about the topology change

BPDU封装

图 15-20。BPDU封装

Figure 15-20. BPDU encapsulation

发送配置BPDU

Transmitting Configuration BPDUs

我们在“何时传输配置 BPDU”一节中了解了哪些条件会触发配置 BPDU 的传输。无论出于什么原因发送配置 BPDU,图 15-21的逻辑都适用。

We saw what conditions trigger the transmission of configuration BPDUs in the section "When to Transmit Configuration BPDUs." Regardless of why a configuration BPDU is transmitted, the logic of Figure 15-21 applies.

配置BPDU发送逻辑

图 15-21。配置BPDU发送逻辑

Figure 15-21. Configuration BPDU transmission logic

每个端口的保持定时器强制执行每秒一个 BPDU 的速率限制。当发送 BPDU 时,定时器启动。如果尝试另一次传输并且计时器已挂起,则不会传输 BPDU,并且会在桥接端口配置块中设置一个标志。当定时器到期时,它会检查该标志,如果发现该标志已设置,则发送配置 BPDU。

The per-port Hold timer enforces a rate limit of one BPDU per second. When a BPDU is transmitted, the timer is started. If another transmission is attempted and the timer is already pending, the BPDU is not transmitted and a flag is set in the bridge port configuration block. When the timer expires, it checks the flag and transmits a configuration BPDU if it finds the flag set.

当根桥发送BPDU时,定时器被初始化为本地配置的值;否则,将使用从根桥收到的消息。对于根桥来说,消息年龄和根路径成本都是0。

When the root bridge transmits a BPDU, the timers are initialized to the values configured locally; otherwise, the ones received from the root bridge are used instead. Message age and root path cost are both 0 for the root bridge.

另外,图中未显示以下内容:

Also, the following is not shown in the figure:

  • 当网桥需要确认收到 TCN BPDU 时,它会设置 TCA 标志

  • When the bridge needs to acknowledge the reception of a TCN BPDU, it sets the TCA flag .

  • 根桥设置TC标志如果拓扑更改计时器正在运行。

  • The root bridge sets the TC flag if the Topology Change timer is running.

  • 如果根端口上收到的最后一个 BPDU 设置了 TC 标志,则非根网桥会设置 TC 标志。

  • Nonroot bridges set the TC flag if the last BPDU received on the root port had the TC flag set.

处理入口帧

Processing Ingress Frames

我们在 第 14 章中看到了简单的网桥如何处理入口流量。现在让我们看看运行 STP 的网桥如何处理入口流量。

We saw in Chapter 14 how a simple bridge handles ingress traffic. Let's now see how a bridge running the STP handles ingress traffic.

入口流量现在不仅包括数据流量,还包括 BPDU。无论是否启用 STP,网桥都以相同的方式处理数据流量。唯一的区别是被 STP 阻止的端口无法转发任何数据流量,因为它们不被视为树的一部分。

Ingress traffic now includes not only data traffic, but BPDUs as well. Bridges handle data traffic the same way, regardless of whether STP is enabled. The only difference is that ports blocked by STP cannot forward any data traffic because they are not considered part of the tree.

入口 BPDU

Ingress BPDUs

与数据流量不同,入口 BPDU 可以在任何未被管理禁用的端口上接受,包括那些处于阻塞状态的端口。

Unlike data traffic, ingress BPDUs are accepted on any port that has not been administratively disabled, including those in the blocking state.

通过BPDU type字段可以区分配置BPDU和TCN BPDU,如图15-7所示。在“让所有网桥了解拓扑更改”部分中,我们已经了解了如何处理入口 TCN BPDU。在下一节中,我们将了解如何处理配置 BPDU。

Configuration BPDUs and TCN BPDUs can be distinguished thanks to the BPDU type field, as shown in Figure 15-7. In the section "Letting All Bridges Know About a Topology Change," we already saw how ingress TCN BPDUs are handled. In the next section, we will see how configuration BPDUs are processed.

入口配置 BPDU

Ingress Configuration BPDUs

入口配置BPDU的处理过程如图15-22所示。

Figure 15-22 shows how ingress configuration BPDUs are processed.

入口 BPDU 的处理取决于其优先级向量是否为:

The handling of an ingress BPDU depends on whether its priority vector is:

比接收网桥端口当前已知的更好
Better than the one currently known to the receiving bridge's port

在这种情况下,BPDU 会触发配置更新,其中包括新的根端口、指定端口以及所有端口的新状态。

处理入口配置BPDU

图 15-22。处理入口配置BPDU

In this case, the BPDU triggers a configuration update that includes the new root port, the designated ports, and the new state for all ports.

Figure 15-22. Processing ingress configuration BPDUs

与接收桥端口已知的相同
The same as the one already known to the receiving bridge's port

这是拓扑已经收敛时根端口上收到的内容。

This is what would be received on the root port when the topology has already converged.

比接收桥端口已知的更差
Worse than the one known to the receiving bridge's port

在这种情况下,网桥通过发送带有其自己的(更好的)信息的配置 BPDU 进行回复。这是将新网桥添加到拓扑中时发生的常见情况:最初该网桥不知道有关其他网桥的任何信息,因此会通告其信息。它也可能发生在许多其他情况下,例如当网桥配置更改时。

In this case, the bridge replies by sending a configuration BPDU with its own (better) information. This is a common case that happens when a new bridge is added to the topology: initially the bridge does not know anything about the other bridges and therefore advertises its information. It can also happen in numerous other cases, such as when a bridge configuration is changed.

当入口 BPDU 声明的优先级向量比接收器端口已知的优先级向量更好时,需要处理一种特殊情况:当接收网桥是根桥时,它必须放下其冠。正如我们在“拓扑更改”部分中提到的,这是被视为拓扑更改的事件之一。在这种情况下,失去根角色的网桥必须停止 Hello 定时器(因为它仅在根网桥上运行),从其根端口向新的根网桥发送 TCN BPDU,并启动 TCN 定时器通知根网桥有关拓扑更改的信息(它将负责通知所有其他网桥)。

When an ingress BPDU claims a better priority vector than the one known to the receiver port, there is one special case to handle: when the receiving bridge was the root bridge it must lay down its crown. As we mentioned in the section "Topology Changes," this is one of the events that is considered a topology change. In such a case, the bridge that lost the root role must stop the Hello timer (because it is to be run only on the root bridge), send a TCN BPDU out its root port toward the new root bridge, and start the TCN timer to notify the root bridge about the topology change (which will take care of notifying all other bridges).

当根端口收到 BPDU 时,网桥会保存 BPDU 中的计时器(将在其出口 BPDU 中使用),并从其所有指定端口发送配置 BPDU。当TCA标志被置位时,TCN定时器可以被停止。

When the BPDU is received on the root port, the bridge saves the timers from the BPDU (which it will use in its egress BPDUs) and transmits a configuration BPDU out all of its designated ports. When the TCA flag is set, the TCN timer can be stopped.

收敛时间

Convergence Time

我们已经了解了 STP 如何根据配置更改以及链路或网桥故障动态更新树的拓扑。现在让我们看看 STP 需要多少时间来检测常见故障并做出相应反应。

We have seen how the STP dynamically updates the topology of the tree based on configuration changes and link or bridge failures. Let's see now how much time STP needs to detect common failures and react accordingly.

当复杂场景下进行配置更新时,网络可能需要几分钟才能收敛并稳定。[ * ]在此期间,拓扑仍然无环路,但可能无法正确承载流量(因为流量传输时拓扑仍在变化)。在这些设置中,不可能准确预测拓扑如何演变为新的稳定树,因为 BPDU 接收和传输的时序取决于多个因素,例如网桥当时的负载情况。

When a configuration update takes place on a complex scenario, the network may require minutes before it converges and stabilizes.[*] During that time the topology is still loop free, but it may not be able to carry traffic properly (because the topology is still changing while the traffic is in transit). In those setups, it is not possible to predict exactly how the topology evolves toward a new stable tree, because the timing of BPDU receptions and transmissions depends on several factors, such as how loaded the bridges are at that moment.

然而,无论您如何配置网桥,都存在无法消除或减少的最小延迟。例如:

However, no matter how well you configure the bridges, there are minimum latencies that cannot be eliminated or reduced. For example:

  • 当端口改变状态时,例如从阻塞转移到转发以替换发生故障的桥接端口,则不会立即转换到转发,而是需要两倍的转发延迟计时器时间(即默认情况下为 30 秒),如下所示如图15-6所示。在此期间端口无法转发任何数据流量。

  • When a port changes state, moving, for example, from blocking to forwarding to replace a failing bridge port, the transition to forwarding is not immediate, but takes twice the time of the Forward Delay timer (i.e., 30 seconds by default), as shown in Figure 15-6. The port cannot forward any data traffic during this time.

  • 根端口和非指定端口(即接收 BPDU 的端口)只有在其消息时效定时器到期时才意识到它们已失去与指定网桥的连接(因此与整个树的连接也已失去,除了网桥指定端口 [ * ] 以下部分例如,如果图 15-2(a)中通往 D1 的 C1 端口 由于某种原因发生故障,则 D1 可能仅在 Max Age 计时器到期 20 秒后才知道它。

  • Root and nondesignated ports (i.e., the ones that receive BPDUs) realize that they have lost connection to their designated bridge (and therefore to the entire tree except for the portions below the bridge's designated ports[*]) only when their Message Age timer expires. For example, if the C1 port that goes to D1 in Figure 15-2(a) failed for some reason, D1 may come to know it only after 20 seconds when the Max Age timer expires.

请注意,这两种情况都是由定时器驱动的。当然,您可以将 Forward Delay 和 Max Age 计时器配置为更快到期,从而减少收敛时间。但是,根据网络的复杂程度,您可能并不总是能够使用过于激进的计时器。

Note that both of these cases are driven by timers. Of course, you can configure both the Forward Delay and Max Age timers to expire faster and therefore reduce the convergence time. However, depending on how complex the network is, you may not always be able to use timers that are too aggressive.

让我们看一个基于图 15-2 的示例,看看当 D1 桥由于某种原因发生故障时会发生什么。由于 A1 和 A2 都使用 D1 来访问网络的其余部分,因此连接到 A1 和 A2 的所有主机都与网络的其余部分隔离,直到 STP 设法为 A1 和 A2 选择新的根端口。那么,STP需要多长时间才能做出这样的改变呢?在最坏的情况下,会发生以下情况:

Let's see an example based on Figure 15-2 and see what happens when the D1 bridge fails for some reason. Because both A1 and A2 use D1 to access the rest of the network, all the hosts connected to A1 and A2 are isolated from the rest of the network until STP manages to select a new root port for both A1 and A2. So, how long would it take for STP to make such changes? In the worst-case scenario, this is what would happen:

  • D1 停止正常工作。

  • D1 stops working properly.

  • 20 秒后,A1 和 A2 根端口的消息时效计时器均到期。

  • After 20 seconds, the Message Age timer expires in both A1 and A2 root ports.

  • A1和A2选择去往D2的端口作为新的根端口。

  • A1 and A2 select the port that goes to D2 as the new root port.

  • 再过30秒,这些新的根端口进入转发状态。

  • After another 30 seconds, those new root ports enter the forwarding state.

这意味着连接到 A1 和 A2 的主机可能会出现 50 秒的黑洞。在复杂的网络中,拓扑可能需要比这更多的时间来收敛,甚至几分钟。

This means a potential black hole of 50 seconds for the hosts connected to A1 and A2. In a complex network, the topology may require more time than this to converge, even a few minutes.

较新生成树协议概述

Overview of Newer Spanning Tree Protocols

几年前 IEEE 委员会首次定义 802.1D-1998 STP 时,其收敛时间是可以接受的。然而,多年来事实证明,鉴于交互式多媒体(IP 电话、视频会议等)等新型网络应用程序对可用性的更高要求,更不用说不断增长的用户期望,它的速度太慢了。

The convergence time of the 802.1D-1998 STP was acceptable when it was first defined several years ago by the IEEE committee. However, it has proven too slow over the years, given the higher availability requirements of newer network applications, such as interactive multimedia (IP telephony, video conferencing, etc.), not to mention the user expectations that continuously grow.

为了解决这个问题,各种商业桥梁生产商推出了 STP 的专有增强功能。不幸的是,专有的增强功能通常不能用于采用不同供应商设备的异构网络。

To address this issue, various commercial bridge producers came out with proprietary enhancements to the STP. Unfortunately, proprietary enhancements often cannot be used in heterogeneous networks that employ devices from different vendors.

最近,IEEE 推出了两个较新的协议:快速生成树 (RSTP))和多生成树(MSTP),解决了其哥哥 802.1D 目前已知的所有重大缺陷。我们无法在这里描述这两个协议,因为它们需要相当多的空间(尤其是 MSTP),而且无论如何,它们都没有在 Linux 中实现(目前为止)。在下面的两小节中,我们将探讨新协议提供的一些主要改进。有关更多详细信息,我建议使用以下文档:

Recently the IEEE came out with two newer protocols, Rapid Spanning Tree (RSTP ) and Multiple Spanning Tree (MSTP), that address all the significant shortcomings currently known in their older brother, 802.1D. We cannot describe the two protocols here because they would require quite some space (especially MSTP), and anyway, none of them is implemented in Linux (yet). In the following two subsections, we will just explore some of the main improvements offered by the new protocols. For more detail, I suggest the following documents:

快速生成树协议 (RSTP)

Rapid Spanning Tree Protocol (RSTP)

RSTP 向后兼容 802.1D,因此运行旧协议和新协议的网桥可以毫无问题地互操作。然而,并非 RSTP 引入的所有增强功能都可以在此类异构环境中启用。以下是一些增强功能:

The RSTP is backward compatible with 802.1D, so bridges running older and newer protocols can interoperate without problems. However, not all the enhancements introduced by the RSTP would be enabled in such heterogeneous environments. Here are some of the enhancements:

  • 现在每个桥接端口都被分配了一个角色。既不是指定端口也不是根端口的端口将被分配为备用或备份角色。Alternate 用于表示通向根桥的备用路径(当前根路径的潜在替代)的端口,而 backup 用于表示通向子树的备用路径(指定端口的潜在替代)的端口。例如:

    • 图15-10中,BR2的端口2将是备用端口,因为它提供了到根桥的备用路径。

    • 图15-11(c)中,BR1的端口2将是备份端口,因为它代表候选指定端口,但同一网桥上的另一个端口具有更好的优先级向量。

  • Each bridge port is now assigned a role. Ports that are neither designated nor root are assigned either the alternate or the backup role. Alternate is used for ports that represent alternate paths toward the root bridge (potential replacements for the current root path), and backup is used for ports that represent alternate paths to the subtree (potential replacements for the designated port). For example:

    • In Figure 15-10, BR2's port 2 would be an alternate port because it provides an alternate path to the root bridge.

    • In Figure 15-11(c), BR1's port 2 would be a backup port because it represents a candidate designated port, but another port on the same bridge has a better priority vector.

  • 端口可以​​分配的可能状态已被简化:新的丢弃状态包括旧的、禁用的、阻塞的和监听的状态。

  • The possible states a port can be assigned have been simplified: the new discarding state includes the old, disabled, blocking and listening states.

  • RSTP 能够通过端口之间的握手和确保避免环路的称为同步的机制更快地将端口转换到转发状态。RSTP 中这一有趣的新改进仅对点对点链路有效。

  • RSTP is able to transition a port to the forwarding state much faster, by means of handshakes between ports and a mechanism called sync that makes sure loops are avoided. This new and interesting improvement in RSTP is effective only on point-to-point links.

  • 当根端口被替换后,可以立即进入转发状态(即无需等待两次Forward Delay定时器超时,如图15-6所示 。这是可能的,因为协议有一种机制可以确保立即转换到转发不会导致任何循环。

  • When the root port is replaced, it can go to the forwarding state immediately (i.e., no need to wait two times for the Forward Delay timer to expire, as shown in Figure 15-6). This is possible because the protocol has a mechanism to ensure that this immediate transition to forwarding will not cause any loop.

  • 仅前两项增强就意味着收敛时间明显加快,甚至可能是亚秒(取决于拓扑的复杂性)。

  • The previous two enhancements alone imply a significantly faster convergence time, perhaps even a subsecond (depending on the complexity of the topology).

  • 所有网桥现在都运行 Hello 计时器 ,并独立生成配置BPDU。这样可以更快地检测连接问题。

  • All bridges now run the Hello timer , and generate configuration BPDUs independently. This allows faster detection of connectivity problems.

  • 连接问题的检测不再依赖于 Max Age 计时器。现在,当应该接收BPDU的端口连续三个Hello时间段没有收到BPDU时,就会启动恢复机制。旧的 Max Age 计时器仍然存在,但仅在前面提到的新 RSTP 过程不适用时使用。

  • The detection of connectivity problems is no longer dependent on the Max Age timer. Now, when a port that is supposed to receive BPDUs does not receive them for three Hello time periods in a row, it starts the recovery mechanism. The old Max Age timer is still there, but is used only when the new RSTP procedure, previously mentioned, is not applicable.

  • 拓扑变化的处理方式也不同。不再需要 TCN BPDU。现在,检测到拓扑变化的网桥只需从其根端口和指定端口发送带有 TC 标志设置的 BPDU。接收到此类 BPDU 的每个其他网桥都会简单地重复该过程:它从其每个转发端口(除了接收原始端口的端口)传输带有 TC 标志的 BPDU。这是一种向各个方向传播拓扑更改通知的简单机制。当网桥收到设置了 TC 标志的 BPDU 时,它不会启动短老化计时器,而是刷新在其所有端口上获悉的地址(接收 BPDU 的端口除外)。

  • Topology changes are handled differently, too. There is no need for TCN BPDUs anymore. Now the bridge that detects a topology change simply transmits a BPDU with the TC flag set from its root and designated ports. Every other bridge that receives such a BPDU simply repeats the process: it transmits a BPDU with the TC flag set out of each of its forwarding ports, except the one from which it received the original. This is a simple mechanism for spreading the topology change notification in all directions. When a bridge receives a BPDU with the TC flag set, it does not start the Short Aging timer, but instead flushes the addresses learned on all its ports, except the one from which the BPDU was received.

  • 与 STP 相比,RSTP 使用的 BPDU 结构变化很小。标志字段现在使用全部 8 位来满足更新增强功能的需求。

  • The structure of the BPDUs used by RSTP has changed very little compared to STP. The flags field now uses all 8 bits to accommodate the needs of the newer enhancements.

在撰写本章时,还没有适用于 Linux 的 RSTP 开源实现。不过,您可以在 SourceForge ( http://rstplib.sourceforge.net )上找到用户空间模拟器。

At the time this chapter was written, there was no open source implementation of RSTP available for Linux. However, you can find a user-space simulator on SourceForge (http://rstplib.sourceforge.net).

多生成树协议 (MSTP)

Multiple Spanning Tree Protocol (MSTP)

MSTP 或多或少是在 RSTP 定义的同时设计的。它引入的主要增强功能是定义多个独立生成树的可能性。每个生成树承载其自己的数据流量子集。

MSTP was designed more or less at the same time RSTP was defined. The main enhancement it introduces is the possibility of defining multiple independent spanning trees. Each spanning tree carries its own subset of data traffic.

用于每个数据包的生成树的选择基于数据包源自的 VLAN。图15-23显示了在两组不同的VLAN上配置的主机,并且MSTP为每组VLAN构建一个单独的生成树。

The selection of the spanning tree to use for each data packet is based on the VLAN where the packet originated. Figure 15-23 shows hosts configured on two different groups of VLANs, and for each group of VLANs the MSTP builds a separate spanning tree.

定义两个生成树实例的桥接网络示例

图 15-23。定义两个生成树实例的桥接网络示例

Figure 15-23. Example of bridged network that defines two spanning tree instances

从主机角度来看,结果如图15-24所示。

From the hosts' perspective, the result is Figure 15-24.

如今,在桥接网络上使用 VLAN 已相当普遍。这是在桥接网络上创建不同 L2 广播域的便捷方法。在同一网络上定义多个生成树拓扑的可能性有几个优点。其中包括更好地利用网络带宽(即更好地利用冗余链路——即负载平衡)。这意味着每个链路上的负载较低。

Nowadays it is pretty common to use VLANs on bridged networks. It's a convenient way to create different L2 broadcast domains on bridged networks. The possibility of defining multiple spanning tree topologies on the same network has several advantages. Among them is better use of the network bandwidth (i.e., better use of redundant links—that is, load balancing). This translates to a lower load on each link.

MSTP 将 RSTP 用于其每个生成树实例。该协议实际上比看起来更复杂:不同的 STP 实例是独立的,但有一个特殊实例在协议中起着核心作用,特别是在如何交换 BPDU 以及如何保持与先前协议的向后兼容性方面。

The MSTP uses the RSTP for each of its spanning tree instances. The protocol is actually more complex than it may seem: the different STP instances are independent, but there is one special instance that plays a central role in the protocol, especially with regard to how BPDUs are exchanged and how backward compatibility with previous protocols is maintained.

图15-23中的两棵生成树

图 15-24。图15-23中的两棵生成树

Figure 15-24. The two spanning trees in Figure 15-23

在撰写本书时,还没有 MSTP 的开源实现。

At the time this book was written, there was no open source implementation of the MSTP.




[ * ] 12 位 MAC 地址(例如 11:22:33:44:55:66)被替换为简单的 2 位值,以使数字更易于阅读。

[*] 12-digit MAC addresses (such as 11:22:33:44:55:66) are replaced with simple 2-digit values to make the figure more readable.

[ * ]我们将看到,STP 定义了一种机制,如果有多个最佳实例可用,则每次都会确定性地选择同一棵树。

[*] We will see that the STP defines a mechanism to deterministically choose the same tree every time, if more than one optimal instance is available.

[ * ]图中的链接未分配成本。您可以假设它们的成本为 1,因此路径成本为跳数。这只是一个例子;1 不是分配给链接的默认成本。

[*] The links in the figure are not assigned costs. You can assume their costs to be 1, and therefore the path cost to be the hop count. This is just an example; 1 is not the default cost assigned to links.

[ * ]如果使用集线器等共享介质来连接网桥,如图15-11(c)所示,根网桥也可以具有非指定端口。较新的 RSTP 协议将该端口称为备份端口(请参阅“快速生成树协议 (RSTP) ”部分)。

[*] If you use shared media such as hubs to connect bridges, as in Figure 15-11(c), the root bridge can have nondesignated ports as well. The newer RSTP protocol would call that port a backup port (see the section "Rapid Spanning Tree Protocol (RSTP)."

[ * ]请记住,任何链路的路径成本都是本地配置的参数,并且不包含在配置 BPDU 中。

[*] Remember that the path cost of any link is a locally configured parameter and it is not carried in the configuration BPDUs.

[ * ]如果您对如何定义默认计时器感兴趣,请阅读 IEEE 802.1d 规范或http://www.cisco.com/warp/public/473/122.html

[*] If you are interested in how the default timers have been defined, read either the IEEE 802.1d specification or http://www.cisco.com/warp/public/473/122.html.

[ * ]可以配置网桥以减少故障或配置更改的影响,从而有助于更快的收敛。

[*] It is possible to configure the bridges to reduce the impact of a failure or configuration change and thus contribute to faster convergence.

[ * ]因此,如果桥是树中的叶子(即访问桥),则它是完全隔离的。

[*] Thus, if the bridge is a leaf in the tree (i.e., an access bridge), it is completely isolated.

第 16 章桥接:Linux 实现

Chapter 16. Bridging: Linux Implementation

本章从桥接规范和协议的一般讨论开始,展示 Linux 如何完成这项工作。

This chapter moves on from the general discussion of the bridging specifications and protocols to show how Linux does the job.

我们在第 10 章中看到了桥接代码如何捕获netif_receive_skb. 在本章中,我们将确切地了解这些入口数据包是如何处理的。我们将看到桥接代码如何操纵设备状态并处理入口流量,无论是在 STP 启用还是未启用时。

We saw in Chapter 10 how the bridging code can capture ingress packets in netif_receive_skb. In this chapter, we will see exactly how those ingress packets are processed. We will see how the bridging code manipulates device states and processes ingress traffic, both when the STP is enabled and when it is not.

关于桥接代码的性能评估,请参考 James T. Yu 的论文《Linux Bridge 的性能评估》,通过网络搜索即可找到。

For a performance evaluation of the bridging code, please refer to the paper "Performance Evaluation of Linux Bridge" by James T. Yu, which you can find with a web search.

桥接设备抽象

Bridge Device Abstraction

在Linux中,网桥是一个虚拟设备。因此,除非您将一个或多个真实设备绑定到它,否则它无法接收或传输任何内容。我们将使用术语“从属”来指代将真实设备绑定到(虚拟)桥接设备的过程。

In Linux, a bridge is a virtual device. As such, it cannot receive or transmit anything unless you bind one or more real devices to it. We will use the term enslave to refer to the process of binding a real device to a (virtual) bridge device.

假设我们要实现图 16-1的拓扑。图中有几点值得强调:

Let's suppose we want to implement the topology of Figure 16-1. A few points in the figure deserve emphasis:

  • 该网桥合并了两个 LAN。LAN1和LAN2的主机配置在同一子网10.0.1.0/24上。

  • The bridge merges two LANs. The hosts of LAN1 and LAN2 are configured on the same subnet, 10.0.1.0/24.

  • 网桥连接到路由器,以便 LAN1 和 LAN2 的主机能够与 LAN3 的主机通信。

  • The bridge is connected to a router so that the hosts of LAN1 and LAN2 can communicate with the hosts of LAN3.

  • 从路由器的角度来看, eth0上有一个 LAN 。

  • From the router's perspective, there is a single LAN on eth0.

由于Linux同时实现了路由和桥接,因此我们可以将两个设备合并到一个Linux系统中,并获得如图16-2(a)所示的拓扑。网桥和路由器之间的网络连接位于内核内部。

Because Linux implements both routing and bridging, we can merge the two devices into a single Linux system and obtain something like the topology in Figure 16-2(a). The network connection between the bridge and the router is internal to the kernel there.

桥的使用示例

图 16-1。桥的使用示例

Figure 16-1. Example of use of a bridge

桥接设备抽象

图 16-2。桥接设备抽象

Figure 16-2. Bridge device abstraction

现在内核必须能够处理以下两个问题:

Now the kernel must be able to handle the following two issues:

  • 在路由器级别,即使有三个接口( eth0eth1eth2),它也只能看到两个子网(10.0.1.0/24、10.0.2.0/24 )。

  • At the router level, it sees only two subnets (10.0.1.0/24, 10.0.2.0/24), even though there are three interfaces (eth0, eth1, eth2).

  • 它应该仅在eth0eth1之间桥接,并将这两个接口视为在同一 IP 子网上配置。

  • It should bridge only between eth0 and eth1, and consider the two interfaces as configured on the same IP subnet.

由于桥接设备的抽象方式,这两个问题可以独立且优雅地处理。

These two issues are handled independently and elegantly, thanks to the way the bridge device is abstracted.

当您创建桥接设备时,您必须告诉内核要从属哪个接口。换句话说,您必须告诉内核要桥接哪些接口。继续我们的示例,我们将创建一个桥接设备,我们将其称为br0,并将eth0eth1分配给它。因为eth0eth1是桥接接口,所以它们不需要任何 IP 配置——它们根本不需要在 L3 层看到,就像图 16-1 中的桥接接口都没有任何 IP 配置一样。相反,您可以将路由器到网桥的链路所具有的 IP 配置分配给网桥设备。图16-1。结果是图 16-2(b)中的配置。

When you create a bridge device, you must tell the kernel which interface to enslave to it. In other words, you must tell the kernel which interfaces to bridge. Sticking to our example, we would create a bridge device, let's call it br0, and assign eth0 and eth1 to it. Because eth0 and eth1 are the bridge interfaces, they do not need any IP configuration—they don't need to be seen at the L3 layer at all, just as none of the bridge interfaces in Figure 16-1 had any IP configuration. Instead, you assign to the bridge device the IP configuration that the router's link to the bridge had in Figure 16-1. The result is the configuration in Figure 16-2(b).

此时,路由子系统可以根据 eth2br0上配置的子网进行路由。当尝试在br0上进行传输时,桥接设备驱动程序会应用我们在第 14 章中看到的逻辑来管理从属设备:如果转发数据库知道目标 MAC 地址所在的位置,则该帧仅在正确的桥接端口上传输;否则,将被洪泛到桥设备的所有桥端口。

At this point, the routing subsystem can route based on the subnets configured on eth2 and br0. When a transmission on br0 is attempted, the bridge device driver manages the enslaved devices, applying the logic we saw in Chapter 14: if the forwarding database knows where the destination MAC address is located, the frame is transmitted only on the right bridge port; otherwise, it is flooded to all bridge ports of the bridge device.

我们在第 11 章中看到,设备上的传输是通过dev_queue_xmit. 图 16-3(a)dev_queue_xmit显示了示例中可以要求传输帧的设备。dev_queue_xmit调用hard_start_xmit设备驱动程序提供的例程。桥接设备驱动程序使用的功能会查阅转发数据库并选择正确的出口设备,或者在必要时使用泛洪。详细信息将在稍后的“总体情况”部分中提供。

We saw in Chapter 11 that transmissions on a device are done with dev_queue_xmit. Figure 16-3(a) shows the devices in our example on which dev_queue_xmit can be asked to transmit frames. dev_queue_xmit invokes the hard_start_xmit routine provided by the device's driver. The function used by the bridging device driver consults the forwarding database and selects the right egress device, or uses flooding if necessary. Details will be provided later in the section "The Big Picture."

我们在第 10 章中看到了处理入口帧的设备驱动程序如何首先初始化结构的几个字段sk_buff,然后将其传递到上层。这些字段之一表示接收该帧的设备。但是,NIC 设备驱动程序对桥接一无所知,因此它无法将入口帧分配给与net_device桥关联的实例。在“处理数据帧”部分中,您将看到如何解决此问题。

We saw in Chapter 10 how a device driver that processes an ingress frame first initializes a few fields of the sk_buff structure and then passes it to the upper layer. One of those fields represents the device on which the frame is received. However, the NIC device driver does not know anything about bridging, so it can't assign an ingress frame to the net_device instance associated with a bridge. In the section "Processing Data Frames," you will see how this issue is taken care of.

从属于桥接设备的设备也可以有自己的 IP 配置。例如,给定图 16-2中的拓扑,如果我们再添加一个子网 (10.0.3.0/24) 并在其上配置eth0 的地址,则 Linux 内核路由层将看到如图16所示的拓扑-4

The devices enslaved to a bridge device can have their own IP configuration, too. For example, given the topology in Figure 16-2, if we added one more subnet (10.0.3.0/24) and configured eth0 with an address on it, the topology would appear to the Linux kernel routing layer like the one in Figure 16-4.

因此, eth0可以接收发送至br0网桥设备或其自身的流量。这意味着图16-3中的模型将变为图16-5中的模型。

eth0 can therefore receive traffic addressed to either the br0 bridge device or itself. This means that the model in Figure 16-3 changes to that in Figure 16-5.

虽然发送部分不需要对代码进行任何调整,但接收部分则需要。默认情况下,从属设备上收到的流量将分配给其指定的桥接设备。例如,在图 16-5中,在eth1上接收到的帧将被分配给br0。是否桥接或路由入口帧的决定(即,在前面的示例中决定是否将入口帧交给eth0还是br0 )可以使用 ebtables 进行配置(请参阅“数据帧与 BPDU ”部分)。

While the transmitting part does not require any tweaking in the code, the receiving part does. By default, traffic received on an enslaved device is assigned to its assigned bridge device. For example, in Figure 16-5, a frame received on eth1 would be assigned to br0. The decision whether to bridge or route an ingress frame (i.e., the decision whether to hand ingress frames to eth0 or br0 in the previous example) can be configured with ebtables (see the section "Data Frames Versus BPDUs").

(a) 在桥接设备上传输; (b) 在桥接设备上接收

图 16-3。(a) 在桥接设备上传输;(b) 在桥接设备上接收

Figure 16-3. (a) Transmitting on a bridge device; (b) receiving on a bridge device

将 L3 配置分配给从属 NIC

图 16-4。将 L3 配置分配给从属 NIC

Figure 16-4. Assigning an L3 configuration to an enslaved NIC

使用 NIC 作为独立接口和桥接端口

图 16-5。使用 NIC 作为独立接口和桥接端口

Figure 16-5. Using an NIC both as a standalone interface and as a bridge port

重要的数据结构

Important Data Structures

下面的列表解释了桥接代码定义和使用的主要数据结构。所有这些在第 17 章中都有专门的章节,其中逐个字段进行了描述。

The following list explains the main data structures defined and used by the bridging code. All of them have dedicated sections with field-by-field descriptions in Chapter 17.

mac_addr
mac_addr

MAC地址。

MAC address.

bridge_id
bridge_id

网桥 ID(第 15 章中定义)。

Bridge ID (defined in Chapter 15).

net_bridge_fdb_entry
net_bridge_fdb_entry

转发数据库的条目。网桥获知的每个 MAC 地址都有一个。

Entry of the forwarding database. There is one for each MAC address learned by the bridge.

net_bridge_port
net_bridge_port

桥港。

Bridge port.

net_bridge
net_bridge

适用于单个网桥的信息。如图16-6所示,该结构被附加到net_device数据结构中。与大多数虚拟设备一样,它包含只有虚拟设备代码(在本例中为桥接)才能理解的私有信息。

Information applying to a single bridge. As shown in Figure 16-6, this structure is appended to a net_device data structure. As with most virtual devices, it includes private information understood only by the virtual device code—bridging, in this case.

br_config_bpdu
br_config_bpdu

入口配置 BPDU 的关键字段被复制到此数据结构中,并且它代替原始 BPDU 被传递到处理配置 BPDU 的例程。

The key fields of an ingress configuration BPDU are copied into this data structure, and it is passed instead of the original BPDU to the routine that processes configuration BPDUs.

所有数据结构都在net/bridge/br_private.h中定义,但 除外br_config_bpdu,它在net/bridge/br_private_stp.h中定义。图 16-6显示了其中一些数据结构之间的关系。该图不反映上一节中看到的任何配置示例。

All data structures are defined in net/bridge/br_private.h, with the exception of br_config_bpdu, which is defined in net/bridge/br_private_stp.h. Figure 16-6 shows the relationships between some of these data structures. The figure does not reflect any of the examples of configurations seen in the previous section.

age_list列表不再使用;我将其包含在图中仅供参考。请参见第 17 章中的“ net_bridge 结构”部分。

The age_list list is not used anymore; I included it in the figure only for reference. See the section "net_bridge Structure" in Chapter 17.

桥接代码的初始化

Initialization of Bridging Code

桥接代码可以内置到内核中,也可以编译为模块。初始化和清理例程br_init和分别在/net/bridge/br.cbr_uninit中定义。

The bridging code can be either built into the kernel or compiled as a module. The initialization and cleanup routines, br_init and br_uninit, respectively, are defined in /net/bridge/br.c.

初始化包括:

Initialization consists of:

  • 通过创建用于分配net_bridge_fdb_entry结构的slab缓存(内存区域)来初始化转发数据库(br_fdb_init)。

  • Initializing the forwarding database by creating a slab cache (a memory area) to use for allocating net_bridge_fdb_entry structures (br_fdb_init).

  • br_ioctl_hook 初始化指向处理命令的例程的函数指针ioctl命令在第 17 章ioctl中描述。

  • Initializing the function pointer br_ioctl_hook to the routine that will take care of ioctl commands. ioctl commands are described in Chapter 17.

  • br_handle_frame_hook初始化指向将处理入口 BPDU 的例程的函数指针。请参阅“处理入口流量”部分。

  • Initializing the function pointer br_handle_frame_hook to the routine that will process ingress BPDUs. See the section "Handling Ingress Traffic."

  • 向通知链注册回调netdev_chain请参阅“ netdevice 通知链”部分。

  • Registering a callback with the netdev_chain notification chain. See the section "netdevice Notification Chain."

当内核编译为支持桥接防火墙时,该选项在这里用 初始化br_netfilter_init。稍后,在图 16-11 的“总体情况”部分中,您可以看到所有 Netfilter 挂钩位于桥接代码用来处理入口和出口流量的核心例程中的位置。

When the kernel is compiled with support for Bridging-Firewalling , the option is initialized here with br_netfilter_init. Later, in Figure 16-11 in the section "The Big Picture," you can see where all the Netfilter hooks are located in the core routines used by the bridging code to process ingress and egress traffic.

主要数据结构类型之间的关系

图 16-6。主要数据结构类型之间的关系

Figure 16-6. Relationships between the main data structure types

桥接防火墙通过选项“网络支持 → 网络选项 → 网络数据包过滤(取代 ipchains)→ 桥接 IP/ARP 数据包过滤”选项添加到内核中。以太网桥接表选项(即 ebtables)在其他地方初始化(请参阅“数据帧与 BPDU ”部分)。

Bridging-Firewalling is added to the kernel with the option "Networking support → Networking options → Network packet filtering (replaces ipchains) → Bridged IP/ARP packet filtering". The Ethernel-Bridging-Tables option (i.e., ebtables) is initialized elsewhere (see the section "Data Frames Versus BPDUs").

清理例程br_deinit只是撤消 所做的事情br_init

The cleanup routine br_deinit simply undoes what was done by br_init.

创建桥接设备和桥接端口

Creating Bridge Devices and Bridge Ports

管理员可以创建的桥接设备的数量没有硬性限制。每个桥接设备最多可以有BR_MAX_PORTS(1,024) 个端口。

There is no hard limit to the number of bridge devices an administrator can create. Each bridge device can have up to BR_MAX_PORTS (1,024) ports.

br_add_bridge桥接设备分别用和来创建和删除br_del_bridge

Bridge devices are created and removed, respectively, with br_add_bridge and br_del_bridge.

使用 可以将端口添加到桥接设备,br_add_if 也可以使用 来将端口删除br_del_if

Ports are added to a bridge device with br_add_if and are removed with br_del_if.

所有四个例程都在保持 Netlink 路由锁定的情况下运行。锁通过 获取 rtnl_lock,并通过 释放rtnl_unlockbr_add_bridgebr_del_bridge自行处理锁定。对于br_add_ifbr_del_if,该dev_ioctl函数会处理它(参见第 17 章)。

All four routines run with the Netlink routing lock held. The lock is acquired with rtnl_lock and is released with rtnl_unlock. br_add_bridge and br_del_bridge take care of locking on their own. For br_add_if and br_del_if, the dev_ioctl function takes care of it (see in Chapter 17).

所有四个例程均在net/bridge/br_if.cbr_中定义。在第 17 章中,您可以了解如何调用它们来响应用户空间中的配置命令。

All four br_ routines are defined in net/bridge/br_if.c. In Chapter 17, you can learn how they are invoked in response to configuration commands in user space.

创建新的桥接设备

Creating a New Bridge Device

尽管桥接设备是虚拟的,第 8 章中有关如何启用、禁用、注册和取消注册设备的讨论仍然适用。我们鼓励您在阅读本节时使用该章作为参考。

Even though bridge devices are virtual, the discussion about how a device is enabled, disabled, registered, and unregistered in Chapter 8 still applies. You are encouraged to use that chapter as a reference when reading this section.

桥接设备的创建和注册遵循第 8 章中描述的模型。唯一的区别是,因为它是一个虚拟设备,所以桥接器需要在其私有区域(即 图16-6net_bridge底部的数据结构)进行额外的初始化。最后一项任务由 负责,其中:new_bridge_dev

The creation and registration of a bridge device follows the model described in Chapter 8. The only difference is that because it is a virtual device, a bridge needs extra initializations in its private area (i.e., the net_bridge data structure at the bottom of Figure 16-6). This last task is taken care of by new_bridge_dev, which:

  • 分配并初始化用作设置例程的net_device数据结构。请参阅“桥接设备设置例程br_dev_setup”部分。

  • Allocates and initializes a net_device data structure using br_dev_setup as the setup routine. See the section "Bridge Device Setup Routine."

  • 初始化私有结构体,如图16-6所示。

  • Initializes the private structure, as shown in Figure 16-6.

  • 将网桥优先级初始化为默认值 32,768 (0x8000)。

  • Initializes the bridge priority to the default value, 32,768 (0x8000).

  • 使用其标识符初始化指定网桥 ID,将根路径成本初始化为 0,将根端口初始化为 0(即无根端口)。这是因为当网桥首次启用时,它认为自己是根网桥。请参阅第 15 章中的“根桥选择”部分。

  • Initializes the designated bridge ID with its identifier, the root path cost to 0, and the root port to 0 (i.e., no root port). This is because when a bridge is first enabled, it believes itself to be the root bridge. See the section "Root Bridge Selection" in Chapter 15.

  • 将老化时间初始化为默认值 5 分钟。

  • Initializes the aging time to the default of 5 minutes.

  • 使用 初始化每个桥定时器br_stp_timer_init

  • Initializes the per-bridge timers with br_stp_timer_init.

请注意,无论网桥是否启用 STP,都会执行生成树参数的初始化。

Note that the initialization of spanning tree parameters is done regardless of whether the STP is enabled for the bridge.

可以为桥接设备指定任何名称。常见的是brNstpN,分别表示禁用和启用生成树协议。例如,如果您定义两个不使用 STP 的网桥,则通常将它们称为br1br2。然而,你的狗的名字也会被接受。

Bridge devices can be assigned any name. Common ones are brN and stpN, when the Spanning Tree Protocol is disabled and enabled, respectively. For example, if you define two bridges that don't use STP, you would conventionally call them br1 and br2. However, your dog's name would be accepted, too.

与任何其他网络设备一样,网桥在/sys/class/net/中分配了一个目录。请参阅第 17 章中的“通过 /sys 文件系统进行调整”部分。

As with any other network device, bridges are assigned a directory in /sys/class/net/. See the section "Tuning via /sys Filesystem" in Chapter 17.

桥接设备设置例程

Bridge Device Setup Routine

有关设备驱动程序在初始化其设备时如何使用设置例程的详细信息,请参阅第 8 章中的“设备类型初始化:xxx_setup 函数”部分。桥接设备使用设置例程。下面的快照显示了有趣的部分:br_dev_setup

Details about how device drivers use the setup routines when initializing their devices can be found in the section "Device Type Initialization: xxx_setup Functions" in Chapter 8. Bridge devices use the br_dev_setup setup routine. The following snapshot shows the interesting part:

void br_dev_setup(struct net_device *dev)
{
    memset(dev->dev_addr, 0 , ETH_ALEN);
    以太设置(开发);
    ...
    dev->do_ioctl = br_dev_ioctl;
    dev->hard_start_xmit = br_dev_xmit;
    dev->open = br_dev_open;
    dev->change_mtu = br_change_mtu;
    dev->stop = br_dev_stop;
    dev->tx_queue_len = 0 ;
    dev->set_mac_addr = NULL ;
    dev->priv_flags = IFF_EBRIDGE;
}
void br_dev_setup(struct net_device *dev)
{
    memset(dev->dev_addr, 0 , ETH_ALEN);
    ether_setup(dev);
    ...
    dev->do_ioctl = br_dev_ioctl;
    dev->hard_start_xmit = br_dev_xmit;
    dev->open = br_dev_open;
    dev->change_mtu = br_change_mtu;
    dev->stop = br_dev_stop;
    dev->tx_queue_len = 0 ;
    dev->set_mac_addr = NULL ;
    dev->priv_flags = IFF_EBRIDGE;
}

桥接设备默认不实现排队。他们让他们的从属设备来处理它,这解释了为什么tx_queue_len初始化为 0。但是,管理员可以tx_queue_len使用ifconfigip link进行配置。

Bridge devices do not implement queuing by default. They let their enslaved devices take care of it, which explains why tx_queue_len is initialized to 0. However, the administrator can configure tx_queue_len with ifconfig or ip link.

当桥接设备上的最大传输单元 (MTU) 更改时,内核必须确保新值不大于从属设备中的最小 MTU 值。这是由 保证的br_change_mtu

When the Maximum Transmission Unit (MTU) on a bridge device is changed, the kernel must ensure that the new value is no bigger than the smallest MTU value among the enslaved devices. This is ensured by br_change_mtu.

网桥 MAC 地址dev_addr被清除,因为它将由其从属设备上配置的 MAC 地址派生(请参阅“网桥 ID 和端口 IDbr_stp_recalculate_bridge_id ”部分)。出于同样的原因,驱动程序不提供 功能。set_mac_addr

The bridge MAC address dev_addr is cleared because it will be derived by the MAC addresses configured on its enslaved devices with br_stp_recalculate_bridge_id (see the section "Bridge IDs and Port IDs"). For the same reason, the driver does not provide a set_mac_addr function.

设置该IFF_EBRIDGE标志是为了内核代码可以在需要时区分桥接设备和其他类型的设备。

The IFF_EBRIDGE flag is set so that kernel code can distinguish bridge devices from other types of devices when needed.

br_dev_ioctl例程处理您可以在桥接设备上发出的一些 ioctl命令。参见 第 17 章

The br_dev_ioctl routine processes some of the ioctl commands you can issue on bridge devices. See Chapter 17.

我们在第 11 章中看到,驱动程序将函数指针初始化hard_start_xmit为它们用于传输的例程。桥接驱动程序将其初始化为br_dev_xmit. 该函数负责实现我们在“桥设备抽象”部分中看到的桥设备抽象。本章后面的图 16-11显示了如何实现该抽象。

We saw in Chapter 11 that drivers initialize the hard_start_xmit function pointer to the routine they use for transmission. The bridging driver initializes it to br_dev_xmit. This function is responsible for implementing the bridge device abstraction we saw in the section "Bridge Device Abstraction." Figure 16-11 later in this chapter shows how that abstraction is implemented.

当以管理方式启用或禁用桥接设备时,内核分别调用 dev_opendev_close,这会为桥接设备调用br_dev_open和。请参阅“启用和禁用桥接设备br_dev_close”部分。

When a bridge device is administratively enabled or disabled, the kernel calls dev_open and dev_close, respectively, which invokes br_dev_open and br_dev_close for bridge devices. See the section "Enabling and Disabling a Bridge Device."

删除桥

Deleting a Bridge

在移除桥接设备之前,必须先将其关闭。如果尚未关闭,br_del_bridge则返回 -EBUSY并拒绝移除设备。要删除它,br_del_bridge请调用del_br,它会完成大部分工作,如下所示:

Before a bridge device can be removed, it must be shut down. If it hasn't been shut down, br_del_bridge returns -EBUSY and refuses to remove the device. To remove it, br_del_bridge invokes del_br, which does most of the work, as follows:

  • 删除其所有桥接端口。对于每个桥接端口,它还会删除/sys中的关联链接(显示为目录)。请参阅“删除桥接端口”部分。

  • Removes all its bridge ports. For each bridge port, it also removes the associated links (which appear as directories) in /sys. See the section "Deleting a Bridge Port."

  • 对于每个端口,使用 删除转发数据库中的所有关联条目 br_fdb_delete_by_port,停止所有端口的计时器,并减少混杂计数器。(混杂计数器将在下一节“将端口添加到网桥”中进行描述。)

  • For each port, removes all the associated entries in the forwarding database with br_fdb_delete_by_port, stops all the port's timers, and decrements the promiscuity counter. (The promiscuity counter is described in the upcoming section, "Adding Ports to a Bridge.")

  • 停止垃圾收集计时器br->gc_timer

  • Stops the garbage collection timer br->gc_timer.

  • 删除/sys/class/net目录中带有br_sysfs_delbr.

  • Removes the bridge device directory in the /sys/class/net directory with br_sysfs_delbr.

  • 使用 取消注册设备unregister_netdevice该功能在第 8 章中描述。

  • Unregisters the device with unregister_netdevice. This function is described in Chapter 8.

将端口添加到网桥

Adding Ports to a Bridge

在目前的桥接实现中,网卡和桥接端口之间是一对一的关系,如图16-6所示。一些商业网桥允许管理员将 NIC 添加到多个网桥设备,并根据用户选择的标准将流量分配到特定网桥,但 Linux 不允许。

In the current implementation of bridging, there is a one-to-one relationship between NICs and bridge ports, as shown in Figure 16-6. Some commercial bridges allow an administrator to add an NIC to multiple bridge devices and assign traffic to a particular bridge based on user-chosen criteria, but Linux does not.

桥接端口添加到桥接设备中br_add_if例程内部结构如图16-7所示。该例程不关心桥接设备上是否启用了 STP。

Bridge ports are added to a bridge device with br_add_if. The routine internals are shown in Figure 16-7. The routine does not care whether the STP is enabled on the bridge device.

该例程从一组健全性检查开始。如果满足以下任一条件,则操作中止:

The routine starts with a set of sanity checks. The operation is aborted if any of the following conditions is met:

  • 与端口关联的设备不是以太网设备(或环回设备)。

  • The device to be associated with the port is not an Ethernet device (or the loopback device).

  • 与端口关联的设备是网桥。如图16-6所示,桥接端口必须分配给真实设备(或非桥接设备的虚拟设备)。

  • The device to be associated with the port is a bridge. As you can see in Figure 16-6, bridge ports must be assigned to real devices (or to virtual devices that are not bridge devices).

  • 桥接端口已分配给设备(即dev->br_port不为 NULL)。

  • The bridge port is already assigned to a device (i.e., dev->br_port is not NULL).

br_add_if 函数

图 16-7。br_add_if 函数

Figure 16-7. br_add_if function

通过这些检查后,将分配新的桥接端口并使用 进行部分初始化new_nbp。特别是,该函数:

When these checks are passed, the new bridge port is allocated and partially initialized with new_nbp. In particular, that function:

  • 为桥接端口分配端口号。请参阅“网桥 ID 和端口 ID ”部分。

  • Assigns a port number to the bridge port. See the section "Bridge IDs and Port IDs."

  • 为端口分配默认优先级。

  • Assigns a default priority to the port.

  • 使用 组合端口号和优先级来计算端口 ID br_make_port_id。在端口 ID 的 16 位中,10 ( BR_PORT_BITS) 用于端口号,6 位用于端口优先级。请注意,这不符合第 15 章“网桥和端口 ID ”部分中描述的标准规范。

  • Computes the port ID by combining the port number and priority using br_make_port_id. Out of the 16 bits in the port ID, 10 (BR_PORT_BITS) are used by the port number and 6 by the port priority. Note that this does not conform to the standard specifications described in the section "Bridge and Port IDs" in Chapter 15.

  • 根据从属设备的速度为端口分配默认成本。cost 是通过br_initial_port_cost(看看如何 调用)来选择的,它通过第 8 章new_npb中介绍的 ethtool 接口读取设备速度,并将其转换为 cost。当从设备的设备驱动程序不支持 ethtool 接口时,无法从该接口导出默认成本,因此根据从设备是以太网 10 Mbit/s 设备的假设来选择成本。端口速度和默认端口成本之间的关联在 IEEE 802.1D 协议规范中定义。

  • Assigns a default cost to the port based on the enslaved device's speed. The cost is selected with br_initial_port_cost (see how new_npb is called), which reads the device speed via the ethtool interface introduced in Chapter 8, and converts it into a cost. When the device driver of the enslaved device does not support the ethtool interface, a default cost cannot be derived from the interface, so the cost is selected based on the assumption that the enslaved device is an Ethernet 10 Mbit/s device. The association between port speed and default port cost is defined in the IEEE 802.1D protocol specification.

  • 指定初始BR_STATE_DISABLED 状态。

  • Assigns the initial BR_STATE_DISABLED state.

  • 将桥端口链接到从设备和桥设备,如图 16-6所示。

  • Links the bridge port to the enslaved device and to the bridge device, as shown in Figure 16-6.

与新桥接端口关联的设备的 MAC 地址将添加到转发数据库中br_fdb_insert

The MAC address of the device associated with the new bridge port is added to the forwarding database with br_fdb_insert.

br_sysfs_addif添加到/sys的必要链接 ,如第 17 章“通过 /sys 文件系统调整”部分所述。

br_sysfs_addif adds the necessary links to /sys, as described in the section "Tuning via /sys Filesystem" in Chapter 17.

与桥接端口关联的 NIC 被置于混杂模式dev_set_promiscuity. 混杂模式用于捕获所有 LAN 流量,并且需要该模式才能使网桥能够完成转发帧的工作。该模式存储为计数器而不是每个端口的布尔标志,因为内核希望能够处理进入混杂模式的嵌套请求。当在桥接端口上启用混杂模式时(确实如此dev_set_promiscuity),相关从属设备上的计数器会递增;当混杂模式被禁用时,计数器会递减。

The NIC associated with the bridge port is put into promiscuous mode with dev_set_promiscuity. Promiscuous mode is used for capturing all LAN traffic, and is needed so that the bridge can do its job of forwarding frames. The mode is stored as a counter rather than a Boolean flag for each port because the kernel wants to be able to handle nested requests to enter promiscuous mode. When promiscuous mode is enabled on a bridge port (as dev_set_promiscuity does), the counter is incremented on the associated enslaved device; when promiscuous mode is disabled, the counter is decremented.

最后,新的网桥端口被添加到网桥的端口列表中,如图16-6所示,并且网桥ID和MTU根据我们在第15章“网桥ID和端口ID ”和“网桥”部分中看到的规则进行更新。设备设置例程”,分别为和。br_stp_recalculate_bridge_iddev_set_mtu

Finally, the new bridge port is added to the bridge's port list shown in Figure 16-6, and the bridge ID and MTU are updated according to the rules we saw in the sections "Bridge IDs and Port IDs" in Chapter 15 and "Bridge Device Setup Routine," with br_stp_recalculate_bridge_id and dev_set_mtu, respectively.

删除桥接端口

Deleting a Bridge Port

删除网桥主要需要撤消在端口创建时所做的操作。图 16-8显示了 的内部结构 br_del_if

Deleting a bridge mainly requires undoing what was done at port creation time. Figure 16-8 shows the internals of br_del_if.

br_del_if 函数

图 16-8。br_del_if 函数

Figure 16-8. br_del_if function

启用和禁用桥接设备

Enabling and Disabling a Bridge Device

我们在“桥接设备设置例程”一节中看到了dev->open如何为桥接设备进行初始化,并且在第 8 章的“启用和禁用网络设备dev->stop部分中看到了如何处理启用和禁用网络设备的管理命令。dev_opendev_close

We saw in the section "Bridge Device Setup Routine" how dev->open and dev->stop are initialized for bridge devices, and we saw in the section "Enabling and Disabling a Network Device" in Chapter 8 how administrative commands to enable and disable a network device are processed by dev_open and dev_close.

br_dev_open通过以下方式启用桥接:

br_dev_open enables a bridge by:

  1. 将桥接设备功能初始化为其从属设备支持的最小通用功能子集br_features_recompute

  2. Initializing the bridge device features to the minimal, common subset of the features supported by its enslaved devices with br_features_recompute

  3. 启用设备进行传输(请参阅第 11 章中的“启用和禁用传输netif_start_queue部分)

  4. Enabling the device for transmission with netif_start_queue (see the section "Enabling and Disabling Transmissions" in Chapter 11)

  5. 启用桥接设备br_stp_enable_bridge

  6. Enabling the bridge device with br_stp_enable_bridge

当您启用桥接设备时,之前受其控制的任何端口也将被启用。

When you enable a bridge device, any port that had previously been enslaved to it would also be enabled.

br_dev_stop就是 的镜像,如图16-9br_dev_open所示。

br_dev_stop is just the mirror image of br_dev_open, as shown in Figure 16-9.

(a) 启用桥梁; (b) 禁用桥

图 16-9。(a) 启用桥梁;(b) 禁用桥

Figure 16-9. (a) Enabling a bridge; (b) disabling a bridge

启用和禁用桥接端口

Enabling and Disabling a Bridge Port

br_stp_enable_port分别使用和启用和禁用桥接端口br_stp_disable_port

A bridge port is enabled and disabled with br_stp_enable_port and br_stp_disable_port, respectively.

要启用桥接端口,必须满足以下所有条件:

For a bridge port to be enabled, all of the following conditions must be met:

  • 关联的从属设备在管理上已启动。

  • The associated enslaved device is administratively UP.

  • 关联的从属设备具有运营商状态。Linux 如何检测载波信号状态的变化,请参见第 8 章“链路状态变化检测”部分。

  • The associated enslaved device has the carrier status. See the section "Link State Change Detection" in Chapter 8 for how Linux detects changes in the carrier signal status.

  • 关联的桥接设备在管理上处于 UP。

  • The associated bridge device is administratively UP.

请注意,网桥设备上没有运营商状态,因为网桥是虚拟设备,因此没有运营商状态。

Note that there is no carrier status on the bridge device, because bridges are virtual devices and therefore have no carrier status.

当使用用户空间命令创建桥接端口并且满足上述三个条件时,桥接端口立即启用。请参阅“将端口添加到网桥”部分。

When a bridge port is created with a user-space command and the preceding three conditions are met, the bridge port is enabled right away. See the section "Adding Ports to a Bridge."

假设端口创建后无法启用,因为至少不满足三个必需条件之一。以下是最终满足每个条件时启用端口的位置:

Let's suppose that when the port was created, it could not be enabled because at least one of the three required conditions was not met. Here is where the port is enabled when each condition eventually is met:

  • 当关闭的网桥设备被激活时,其所有禁用的端口都会启用。

  • When a bridge device that was shut down is activated, all of its disabled ports are enabled.

  • 当从属设备检测到运营商状态时,会通过通知通知桥接代码NETDEV_CHANGE请参阅“ netdevice 通知链”部分。

  • When an enslaved device detects the carrier status, the bridging code is notified with a NETDEV_CHANGE notification. See the section "netdevice Notification Chain."

  • 当关闭的从属设备被激活时,桥接代码会收到通知NETDEV_UP请参阅“ netdevice 通知链”部分。

  • When an enslaved device that was shut down is activated, the bridging code is notified with a NETDEV_UP notification. See the section "netdevice Notification Chain."

当不再满足本节开头列出的三个条件中的任何一个时,桥接端口将被禁用。

A bridge port is disabled when any of the three conditions listed at the beginning of this section is no longer met.

图 16-10总结了启用和禁用桥接端口的步骤和相关功能。请注意,当禁用桥接端口时,非根桥可以成为根桥。“成为根桥”部分描述了该转换。

Figure 16-10 summarizes the steps and associated functions for enabling and disabling a bridge port. Note that when a bridge port is disabled, a nonroot bridge can become the root bridge. That transition is described in the section "Becoming the root bridge."

(a) 启用端口; (b) 禁用端口

图 16-10。(a) 启用端口;(b) 禁用端口

Figure 16-10. (a) Enabling a port; (b) disabling a port

请注意,当端口启用时,首先会对其进行初始化,然后使用 分配正确的状态br_port_state_selection。该函数循环遍历所有网桥端口,以将正确的状态应用于每个端口。但在不运行 STP 的网桥上,该函数实际上最终只是将新端口置于该BR_STATE_FORWARDING状态。这是因为端口被分配了指定的角色(尽管不运行 STP 的网桥不应该关心端口角色)。我们需要记住,大多数例程不区分 STP 是启用还是禁用。例如,br_state_port_selection在所有端口上循环,因为当启用 STP 并进行配置更新时,它可能会更改角色,从而更改许多端口的状态(请参阅“配置更新”)。

Note that when a port is enabled, it is first initialized and then assigned the right state with br_port_state_selection. This function loops over all bridge ports to apply the right state to each one. But on a bridge that does not run STP, the function actually ends up just putting the new port into the BR_STATE_FORWARDING state. This is because the port is assigned the designated role (although a bridge that does not run the STP should not care about port roles). We need to keep in mind that most routines do not distinguish whether the STP is enabled or disabled. For example, br_state_port_selection loops over all ports because, when the STP is enabled and undergoes a configuration update, it may change the role and therefore the state of many ports (see the section "Configuration Updates").

更改桥接端口的状态

Changing State on a Bridge Port

桥接端口要么是活动的,要么是非活动的:关联的状态是BR_STATE_FORWARDINGBR_STATE_BLOCKING。然而,虽然BR_STATE_BLOCKING可以立即将状态分配给端口,但BR_STATE_FORWARDING 只有在首先经过第 15 章“端口状态”一节中介绍的中间状态后才能到达该状态。

A bridge port is either active or inactive: the associated states are BR_STATE_FORWARDING or BR_STATE_BLOCKING. However, while the BR_STATE_BLOCKING state can be assigned right away to a port, the BR_STATE_FORWARDING state is reached only after first going through the intermediate states introduced in the section "Port states" in Chapter 15.

和状态分别由 和BR_STATE_FORWARDING例程BR_STATE_BLOCKING分配。无论托管端口的网桥设备是否正在运行 STP,都会使用相同的两个例程。br_make_forwardingbr_make_blocking

The BR_STATE_FORWARDING and BR_STATE_BLOCKING states are assigned with the br_make_forwarding and br_make_blocking routines, respectively. The same two routines are used regardless of whether the bridge device hosting the port is running the STP.

静态无效 br_make_blocking(结构 net_bridge_port *p)
{
    if (p->state != BR_STATE_DISABLED &&
        p->状态!= BR_STATE_BLOCKING) {
        if (p->state == BR_STATE_FORWARDING ||
            p->状态==BR_STATE_LEARNING)
            br_topology_change_detection(p->br);

        p->状态 = BR_STATE_BLOCKING;
        br_log_state(p);
        del_timer(&p->forward_delay_timer);
    }
}

静态无效 br_make_forwarding(结构 net_bridge_port *p)
{
    if (p->state == BR_STATE_BLOCKING) {
        if (p->br->stp_enabled) {
            p->状态 = BR_STATE_LISTENING;
        } 别的 {
            p->状态 = BR_STATE_LEARNING;
        }
        br_log_state(p);
        mod_timer(&p->forward_delay_timer, jiffies + p->br->forward_delay);
    }
}
static void br_make_blocking(struct net_bridge_port *p)
{
    if (p->state != BR_STATE_DISABLED &&
        p->state != BR_STATE_BLOCKING) {
        if (p->state == BR_STATE_FORWARDING ||
            p->state == BR_STATE_LEARNING)
            br_topology_change_detection(p->br);

        p->state = BR_STATE_BLOCKING;
        br_log_state(p);
        del_timer(&p->forward_delay_timer);
    }
}

static void br_make_forwarding(struct net_bridge_port *p)
{
    if (p->state == BR_STATE_BLOCKING) {
        if (p->br->stp_enabled) {
            p->state = BR_STATE_LISTENING;
        } else {
            p->state = BR_STATE_LEARNING;
        }
        br_log_state(p);
        mod_timer(&p->forward_delay_timer, jiffies + p->br->forward_delay);
    }
}

BR_STATE_BLOCKING请注意,您不能为端口分配介于和 之间的任何中间状态BR_STATE_FORWARDING,这就是为什么如果要求将不处于状态br_make_forwarding的端口更改为 则返回的 原因。然而,中间状态表示端口已经在进入该状态的途中,并且将在适当的计时器到期时到达该状态。BR_STATE_BLOCKINGBR_STATE_FORWARDINGBR_STATE_FORWARDING

Note that you cannot assign a port any of the intermediate states between BR_STATE_BLOCKING and BR_STATE_FORWARDING, which is why br_make_forwarding returns if asked to change a port that is not in the BR_STATE_BLOCKING state to BR_STATE_FORWARDING. However, an intermediate state indicates that the port is already on its way to the BR_STATE_FORWARDING state and will get there when the proper timer expires.

在 中,当未使用 STP 时,br_make_forwarding桥接端口会跳过该 状态。BR_STATE_LISTENING当STP不使用时,所有桥接端口都将被指定为转发状态;因此,您BR_STATE_LEARNING也可以跳过。然而,使用中间状态BR_STATE_LEARNING可以允许网桥学习一些 MAC 地址,并减少空转发数据库所需的洪泛量。

In br_make_forwarding, a bridge port skips the BR_STATE_LISTENING state when the STP is not in use. When the STP is not in use, all bridge ports are going to be assigned the forwarding state; therefore, you can skip BR_STATE_LEARNING, too. However, the use of the intermediate state BR_STATE_LEARNING can allow the bridge to learn some MAC addresses and reduce the amount of flooding that would otherwise be needed with an empty forwarding database.

大局观

The Big Picture

图 16-11显示了桥接代码用来处理入口和出口帧(数据帧和 BPDU)的关键例程。

Figure 16-11 shows the key routines that the bridging code uses to process ingress and egress frames (both data frames and BPDUs).

大局观

图 16-11。大局观

Figure 16-11. The big picture

特别要注意的是:

In particular, note that:

  • IP层有多少个钩子就有多少个(参见第18章18-1)。( )中的另一个钩子(图中未显示)由 ebtables 使用,并在“数据帧与 BPDU ”一节中进行了描述。br_handle_frameNF_BR_BROUTING

  • There are as many hooks as there are at the IP layer (see Figure 18-1 in Chapter 18). One more hook in br_handle_frame (NF_BR_BROUTING), not shown in the figure, is used by ebtables and is described in the section "Data Frames Versus BPDUs."

  • 入口数据帧可能会经过netif_receive_skb两次。在第 10 章netif_receive_skb中进行了描述。另请参阅本章后面的“处理数据帧”部分。

  • Ingress data frames may go through netif_receive_skb twice. netif_receive_skb is described in Chapter 10. See also the section "Processing Data Frames," later in this chapter.

  • 当调用表明有必要时,入口帧被传递netif_receive_skb 到桥接子系统handle_bridge,否则传递到上层协议处理程序(如第 13 章所述)。

  • Ingress frames are passed by netif_receive_skb to the bridging subsystem when a call to handle_bridge indicates it is necessary, or to the upper-layer protocol handlers otherwise (as described in Chapter 13).

  • 例如,由于端口被禁用,入口帧可能会被桥接代码丢弃。

  • Ingress frames may be dropped by the bridging code, for example, because the port is disabled.

  • br_forward 当接收端口被 STP 阻止时,入口数据帧将被丢弃。br_deliver当目标地址是主机本地地址时,不需要从任何端口传输出出口帧。在这两种情况下,不需要的传输都会被过滤掉should_deliver

  • Ingress data frames are dropped by br_forward when the receiving port has been blocked by the STP. Egress frames do not need to be transmitted out of any port by br_deliver when the destination address is local to the host. In both cases, unneeded transmissions are filtered by should_deliver.

  • 出口数据帧经过dev_queue_xmit 两次。在第 11 章dev_queue_xmit中进行了描述。另请参阅本章后面的“在桥接设备上传输”部分。

  • Egress data frames go through dev_queue_xmit twice. dev_queue_xmit is described in Chapter 11. See also the section "Transmitting on a Bridge Device" later in this chapter.

  • br_flood功能在网桥的端口上泛洪帧。对于入口和出口帧来说,洪泛可能都是必要的。无论帧在何处生成,当它被寻址到多播或广播地址,或者寻址到不在转发数据库中的地址时,它都必须被洪泛。从其最终输入参数知道它是否正在处理入口或出口帧,这是它多次调用以在每个网桥端口(用于入口和出口)br_flood上传输帧的函数。_ _br_forward_ _br_deliver

  • The br_flood function floods a frame on the ports of a bridge. Flooding may be necessary for both ingress and egress frames. Regardless of where a frame is generated, when it is addressed to a multicast or broadcast address, or to an address not in the forwarding database, it must be flooded. br_flood knows whether it is handling an ingress or egress frame from its final input parameter, which is the function it calls multiple times to transmit the frame on each bridge port (_ _br_forward for ingress and _ _br_deliver for egress).

在查看下一节中的代码时,您需要记住,无论是否启用 STP,桥接代码都使用同一组核心例程。一些关键的区别在于所执行的子任务。

While looking at the code in the next sections, you need to keep in mind that the bridging code uses the same set of core routines regardless of whether the STP is enabled. Some key differences lie in the subtasks executed.

当STP开启时:

When the STP is enabled:

  • 处理入口 BPDU。

  • Ingress BPDUs are processed.

  • BPDU 也可能在本地生成,具体取决于本地网桥端口的角色。

  • BPDUs may be generated locally too, depending on the roles of the local bridge ports.

  • 根据第 14 章中的规则,入口数据流量要么转发到正确的端口,要么泛洪到所有网桥端口。

  • Ingress data traffic is either forwarded to the right port or flooded to all bridge ports, according to the rules in Chapter 14.

  • STP 阻止的端口不能用于接收和传输数据流量。

  • The ports that STP blocks cannot be used to receive and transmit data traffic.

当STP关闭时:

When the STP is disabled:

  • 入口 BPDU 被视为数据流量。

  • Ingress BPDUs are treated as data traffic.

  • 本地没有生成BPDU。

  • No BPDUs are generated locally.

  • 根据第 14 章中的规则,入口数据流量仍然转发到正确的端口或洪泛到所有网桥端口。

  • Ingress data traffic is still forwarded to the right port or flooded to all bridge ports, according to the rules in Chapter 14.

  • 所有网桥端口(除非在管理上禁用它们)都可用于接收和传输数据流量。

  • All the bridge ports (unless they're administratively disabled) can be used to receive and transmit data traffic.

转发数据库

Forwarding Database

每个网桥实例都有自己的转发数据库,无论 STP 启用还是禁用,都会使用它。我们将在本章后面确切地看到数据库何时被查询和更新。让我们首先看看它的实现以及操作它的核心功能。用于管理转发数据库的所有例程都位于net/bridge/br_fdb.c中。

Each bridge instance has it own forwarding database , which is used regardless of whether STP is enabled or disabled. We will see later in this chapter exactly when the database is consulted and updated. Let's first see its implementation and the core functions for manipulating it. All of the routines used to manage forwarding databases are located in net/bridge/br_fdb.c.

数据库嵌入在net_bridge数据结构中,并被定义为哈希表(见图16-6)。net_bridge_fdb_entry对于在任何网桥端口上获悉的每个 MAC 地址,都会将数据结构的实例添加到数据库中。

The database is embedded in the net_bridge data structure and is defined as a hash table (see Figure 16-6). An instance of a net_bridge_fdb_entry data structure is added to the database for each MAC address that is learned on any of the bridge's ports.

桥转发数据库子系统使用 初始化br_fdb_init,它只是创建br_fdb_cache将用于net_bridge_fdb_entry实例分配的缓存。

The bridge forwarding database subsystem is initialized with br_fdb_init, which simply creates the br_fdb_cache cache that will be used for the allocation of net_bridge_fdb_entry instances.

分配是通过 完成的fdb_create,它还net_bridge_fdb_entry 根据其输入参数初始化 的一些字段。

Allocations are done with fdb_create, which also initializes a few fields of net_bridge_fdb_entry according to its input parameters.

查找

Lookups

转发数据库的元素通过其 MAC 地址来标识。表中的查找包括选择正确的哈希表存储桶并br_mac_hash浏览该存储桶的net_bridge_fdb_entry实例列表以查找与给定 MAC 地址匹配的实例。

Elements of the forwarding database are identified by their MAC addresses. A lookup in the table consists of selecting the right hash table bucket with br_mac_hash and browsing the bucket's list of net_bridge_fdb_entry instances to find one that matches a given MAC address.

有两个主要的查找例程:

There are two main lookup routines:

fdb_find
fdb_find

这只是搜索net_bridge_fdb_entry 给定的 MAC 地址。它不用于转发数据流量。它主要用于桥接管理功能。

This simply searches net_bridge_fdb_entry for a given MAC address. It is not used to forward data traffic. It is mainly used by bridging management functions.

_ _br_fdb_get
_ _br_fdb_get

与 类似fdb_find,桥接代码调用它来转发流量。它不考虑过期的条目(请参阅“老化”部分)。

Similar to fdb_find, this is called by the bridging code to forward traffic. It does not consider expired entries (see the section "Aging").

对于这两个例程,调用者都会确保正确的锁定。

For both routines, proper locking is ensured by the caller.

希望在转发数据库上进行查找的外部子系统可以使用该br_fdb_get例程,它是一个负责锁定和引用计数以及调用的包装器_ _br_fdb_getbr_fdb_get不是直接调用,而是通过 调用br_fdb_get_hook,它被初始化br_init为指向 的指针br_fdb_get

An external subsystem that wishes to make a lookup on the forwarding database can use the br_fdb_get routine, a wrapper that takes care of locking and reference counts and calls _ _br_fdb_get. br_fdb_get is not called directly, but via br_fdb_get_hook, which is initialized in br_init to be a pointer to br_fdb_get.

参考计数

Reference Counts

由于查询转发数据库的外部子系统br_fdb_get可能会缓存结果,因此使用引用计数来跟踪何时仍需要转发数据库中的条目以及何时可以释放它们。每个条目都分配有一个引用计数。br_fdb_get当查找成功时,总是增加引用计数。br_fdb_put当调用者不再需要对查找结果的引用时,应该将其递减。当引用计数降至 0 时,br_fdb_put释放 net_bridge_fdb_entry.

Because external subsystems that query the forwarding database with br_fdb_get are likely to cache the result, a reference count is used to keep track of when entries in the forwarding database are still needed and when they can be freed. Each entry is assigned a reference count. br_fdb_get always increments the reference count when the lookup succeeds. The caller is supposed to decrement it with br_fdb_put when it no longer needs the reference to the lookup result. When the reference count drops to 0, br_fdb_put frees net_bridge_fdb_entry.

添加、更新和删除条目

Adding, Updating, and Removing Entries

转发数据库由一组不同的例程填充和更新,具体取决于 MAC 地址是否与本地接口或入口帧关联。

The forwarding database is populated and updated by a different set of routines, depending on whether the MAC addresses are associated with local interfaces or ingress frames.

创建桥接端口时,br_add_if将从属设备的 MAC 地址添加到转发数据库中br_fdb_insert。后一个函数会忽略不应添加到数据库中的 MAC 地址,例如多播和广播地址。当新地址恰好已在数据库中时,它将被替换,除非它与另一个本地接口关联,在这种情况下不需要任何更新。请注意,转发数据库中的本地 MAC 地址允许桥接代码在本地传送寻址到本地接口的入口帧。因此,本地 MAC 地址与哪个接口关联并不重要。重要的是数据库中至少有一个条目告诉桥接代码要在本地传送哪些流量。

When you create a bridge port, br_add_if adds the enslaved device's MAC address to the forwarding database with br_fdb_insert. The latter function ignores MAC addresses that are not supposed to be added to the database, such as multicast and broadcast addresses. When the new address happens to be in the database already, it is replaced unless it is associated with another local interface, in which case there is no need for any update. Note that local MAC addresses in the forwarding database allow the bridging code to deliver ingress frames addressed to a local interface locally. So it does not matter what interface the local MAC address is associated with. All that matters is that at least one entry in the database tells the bridging code what traffic to deliver locally.

对于可以添加到转发数据库的条目数量没有硬性限制。这可能会使系统遭受 DOS 攻击,因此我们可以预期开发人员会在不久的将来添加硬限制。

There is no hard limit on the number of entries that can be added to the forwarding database. This can expose the system to a DOS attack, so we can expect developers to add a hard limit in the near future.

当与桥接端口关联的本地设备(因此在转发数据库中具有其 MAC 地址)更改其 MAC 地址时,[ * ]其在数据库中的条目将更新(请参阅“ netdevice 通知链br_fdb_change_addr”部分) 。由于多个本地接口可能配置有相同的 MAC 地址(尽管这种情况并不常见),因此在 删除实例之前检查同一网桥的另一个网桥端口是否具有相同的 MAC 地址:如果发现另一个网桥端口使用相同的 MAC 地址,它将数据库条目绑定到剩余端口的接口。bf_fdb_change_addrnet_bridge_fdb_entry

When a local device that is associated with a bridge port—and that therefore has its MAC address in the forwarding database—changes its MAC address,[*] its entry in the database is updated with br_fdb_change_addr (see the section "netdevice Notification Chain"). Because it is possible for multiple local interfaces to be configured with the same MAC address (even though it is not common), bf_fdb_change_addr checks whether another bridge port for the same bridge has the same MAC address before removing the net_bridge_fdb_entry instance: if it finds another bridge port with the same MAC address, it binds the database entry to the interface for the remaining port.

通过入口帧获知的 MAC 地址(如第 14 章所述)将通过 添加到数据库中 br_fdb_update。当地址已在数据库中时,如果需要,将更新对入口端口 ( ) 的引用,并更新dst上次更新的时间戳 ( )。ageing_timer

The MAC addresses learned with ingress frames (as described in Chapter 14) are added to the database with br_fdb_update. When the address is already in the database, the reference to the ingress port (dst) is updated if needed and the timestamp of the last update (ageing_timer) is updated.

net_bridge_fdb_entry实例被删除 fdb_delete。该函数永远不会被直接调用,而是总是通过像br_fdb_cleanup (在下一节中描述)和 之类的包装器来调用br_fdb_delete_by_port

net_bridge_fdb_entry instances are removed with fdb_delete. That function is never called directly, but always through wrappers like br_fdb_cleanup (described in the next section) and br_fdb_delete_by_port.

老化

Aging

对于每个桥实例都有一个垃圾收集计时器 ( gc_timer) 定期扫描转发数据库并删除过期条目。定时器在br_stp_timer_init桥实例初始化时初始化,并在使用 启用桥时启动br_stp_enable_bridge

For each bridge instance there is a garbage collection timer (gc_timer) that periodically scans the forwarding database and deletes expired entries. The timer is initialized in br_stp_timer_init when the bridge instance is initialized, and is started when the bridge is enabled with br_stp_enable_bridge.

计时器每十分之一秒到期一次,并调用br_fdb_cleanup进行清理。该函数扫描数据库并删除过期的条目fdb_delete

The timer expires every one-tenth of second and calls br_fdb_cleanup to do the cleanup. That function scans the database and deletes expired entries with fdb_delete.

如果条目至少 5 分钟未被使用,则通常会过期。然而,当网桥运行STP时,forward_delay当检测到拓扑变化时,会使用较短的老化时间(秒)(参见第15章中的“短老化定时器” 部分)。通过调用例程透明地使用正确的老化时间,该例程根据此处描述的逻辑返回要使用的正确老化时间。hold_time

An entry normally expires if it has not been used for at least 5 minutes. However, when a bridge runs the STP, a shorter aging time of forward_delay seconds is used when a topology change has been detected (see the section "Short Aging Timer" in Chapter 15). The right aging time is used transparently by calling the hold_time routine, which returns the right one to use based on the logic described here.

处理入口流量

Handling Ingress Traffic

我们在第 10 章中看到了如何处理入口流量netif_receive_skb。特别是,我们看到了在将每个入口帧传递给上层协议处理程序之前函数如何调用handle_bridge(在 net/core/dev.c中定义)。

We saw in Chapter 10 how ingress traffic is processed by netif_receive_skb. In particular, we saw how the function calls handle_bridge (defined in net/core/dev.c) before passing each ingress frame to the upper-layer protocol handler.

当内核不支持桥接时,handle_bridge被定义为 NULL 指针netif_receive_skb并将入口帧交给其他协议处理程序。当内核确实支持桥接并且在桥接端口上接收到帧时,handle_bridge使用 处理该帧br_handle_frame_hook。后一个指针在 br_handle_bridge桥接模块初始化时被初始化。

When the kernel does not have support for bridging, handle_bridge is defined as a NULL pointer and netif_receive_skb hands ingress frames to other protocol handlers. When the kernel does support bridging, and a frame is received on a bridge port, handle_bridge processes the frame with br_handle_frame_hook. The latter pointer is initialized to br_handle_bridge when the bridging module is initialized.

#如果已定义(CONFIG_BRIDGE) || 定义(CONFIG_BRIDGE_MODULE)
...
static _ _inline_ _ int handle_bridge(struct sk_buff **pskb,
                    结构 packet_type **pt_prev, int *ret)
{
    结构net_bridge_port *端口;

    if ((*pskb)->pkt_type == PACKET_LOOPBACK ||
        (端口 = rcu_dereference((*pskb)->dev->br_port)) == NULL)
        返回0;

    如果(*pt_prev){
        *ret = Deliver_skb(*pskb, *pt_prev);
        *pt_prev = NULL;
    }

    返回 br_handle_frame_hook(端口, pskb);
}
#别的
#定义handle_bridge(skb, pt_prev, ret) (0)
#万一
#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
...
static _ _inline_ _ int handle_bridge(struct sk_buff **pskb,
                    struct packet_type **pt_prev, int *ret)
{
    struct net_bridge_port *port;

    if ((*pskb)->pkt_type == PACKET_LOOPBACK ||
        (port = rcu_dereference((*pskb)->dev->br_port)) == NULL)
        return 0;

    if (*pt_prev) {
        *ret = deliver_skb(*pskb, *pt_prev);
        *pt_prev = NULL;
    }

    return br_handle_frame_hook(port, pskb);
}
#else
#define handle_bridge(skb, pt_prev, ret)    (0)
#endif

在下面的小节中,我们将了解如何handle_bridge处理入口帧,区分数据帧和 STP BPDU(图 16-11,圆圈 a)。对于数据帧,该功能还区分单播帧和多播或广播帧(图16-11,圆圈b)。

In the following subsections, we will see how handle_bridge processes ingress frames, distinguishing between data frames and STP BPDUs (Figure 16-11, circle a). For data frames, the function also distinguishes between unicast frames and multicast or broadcast frames (Figure 16-11, circle b).

数据帧与 BPDU

Data Frames Versus BPDUs

在支持桥接的 Linux 系统上,并非所有 NIC 都需要配置为桥接端口。当一个端口被配置为桥接端口时, br_port其指针net_device指向关联的桥接端口。由于每个网桥端口都包含一个指向其所属网桥实例的指针,因此您可以轻松地从任何真实设备获取其所属的网桥实例(如果有),并通过读取该设备中的标志来检查是否为该设备启用了 STP。net_bridge数据结构。见图16-6

On a Linux system with support for bridging, not all NICs need to be configured as bridge ports. When one is configured as a bridge port, the br_port pointer of its net_device points to the associated bridge port. Because each bridge port includes a pointer to the bridge instance it is part of, you can easily get from any real device to the bridge instance it belongs to (if any) and check whether STP is enabled for the device by reading a flag in the net_bridge data structure. See Figure 16-6.

STP 生成的 BPDU 与所有其他入口帧不同,并由 STP 接收例程进行处理,但前提是包含入口端口的网桥上启用了 STP。

BPDUs generated by the STP are distinguished from all other ingress frames and are processed by the STP receiving routine—but only when the STP is enabled on the bridge containing the ingress port.

图 16-12显示了如何br_handle_frame将入口帧传递给正确的例程, br_handle_frame_finishbr_stp_handle_bpdu,具体取决于是否启用了 STP。

Figure 16-12 shows how br_handle_frame hands an ingress frame to the right routine, br_handle_frame_finish or br_stp_handle_bpdu, depending on whether STP is enabled.

在禁用端口上收到的任何帧都会被丢弃。

Any frame received on a disabled port is dropped.

仅在处于该状态的端口上接受数据帧BR_STATE_FORWARDING,并且只要启用了 STP,就可以在任何启用的端口上接受 BPDU(否则,它们将被视为普通数据帧)。

Data frames are accepted on ports in the BR_STATE_FORWARDING state only, and BPDUs are accepted on any enabled port as long as the STP is enabled (otherwise, they are treated just as common data frames).

图16-12左侧识别BPDU 的逻辑遵循第15 章“ BPDU 封装”部分介绍的规则。

The logic followed on the left side of Figure 16-12 to recognize BPDUs follows the rules introduced in the section "BPDU Encapsulation" in Chapter 15.

请注意,只有当 Netfilter 没有因其他原因丢弃或消耗帧时,才会调用图 16-12底部的两个例程。

Note that both routines at the bottom of Figure 16-12 are called only if Netfilter does not drop or consume the frame for other reasons.

ebtables 也有机会查看框架。ebtables 是一个提供 Netfilter 不提供的额外功能的框架。

ebtables is also given a chance to look at frames. ebtables is a framework that provides extra capabilities that Netfilter does not provide.

br_handle_frame函数

图 16-12。br_handle_frame函数

Figure 16-12. br_handle_frame function

特别是,ebtables 允许过滤和修改任何帧类型,而不仅仅是那些携带 IP 数据包的帧类型。为了我们讨论的目的,我需要提到 ebtables 的两个功能:

In particular, ebtables allows filtering and mangling of any frame type, not just those that carry IP packets. For the purposes of our discussion, I need to mention two of ebtables's capabilities:

  • 它允许您根据网络协议(即 IPv4)或目标 IP 地址等因素定义规则,告诉内核要桥接哪些流量以及要路由哪些流量。这意味着从属于桥接端口的 NIC 不仅仅充当桥接端口,而且作为独立的 L3 接口存在,并且可以分配其自己的 L3 配置。我们在“桥接设备抽象”部分看到了一个示例。

  • It allows you to define rules to tell the kernel what traffic to bridge and what traffic to route, based on such factors as network protocol (i.e., IPv4) or destination IP address. This means that an NIC that is enslaved to a bridge port doesn't just act as a bridge port, but exists as an independent L3 interface and can be assigned its own L3 configuration. We saw an example in the section "Bridge Device Abstraction."

  • 例如,可以对目标 MAC 地址进行修改,以将帧重定向到另一台主机或实现某种网络地址转换。这就是br_handle_frameebtables 完成后检查目标 MAC 地址的原因。

  • The destination MAC address can be mangled, for example, to redirect the frame to another host or to implement some sort of network address translation. This is why br_handle_frame checks the destination MAC address after ebtables is done.

可以通过选项“网络支持 → 网络选项 → 网络数据包过滤(替换 ipchains)→ 桥接:Netfilter 配置 → 以太网桥接表(ebtables)支持”将对 ebtables 的支持添加到内核中。在 ebtables 主页上,您可以找到关于此功能的非常好的文档,包括用户空间端和内核空间端。您还可以找到有关如何使用 ebtables 提供的每个功能的清晰示例。该项目的主页是http://ebtables.sourceforge.net

Support for ebtables can be added to the kernel with the option "Networking support → Networking options → Network packet filtering (replaces ipchains) → Bridge: Netfilter Configuration → Ethernet Bridge tables (ebtables) support". At the ebtables home page, you can find pretty good documentation for this feature, on both the user-space side and the kernel-space side. You can also find clear examples on how to use each feature provided by ebtables. The project's home page is http://ebtables.sourceforge.net.

处理数据帧

Processing Data Frames

入口数据帧的处理方式如图16-13br_handle_frame_finish所示。

Ingress data frames are handled by br_handle_frame_finish, shown in Figure 16-13.

首先,将帧的源 MAC 地址添加到转发数据库中 br_fdb_update。然后在转发数据库中查找目的MAC地址。如果找到地址,则将帧转发到正确的桥接端口br_forward;否则,它会被洪泛到所有转发桥端口br_flood_frame。寻址到广播或多播链路层地址的帧总是被洪泛。

First, the source MAC address of the frame is added to the forwarding database with br_fdb_update. Then the destination MAC address is searched for in the forwarding database. If the address is found, the frame is forwarded to the right bridge port with br_forward; otherwise, it is flooded to all forwarding bridge ports with br_flood_frame. Frames addressed to the broadcast or multicast link layer addresses are always flooded.

如果满足以下任一条件,帧的副本也会在本地传递br_pass_frame_up(即传递到上层):

A copy of the frame is also delivered locally with br_pass_frame_up (i.e., passed to the upper layer) if any of the following conditions are met:

  • 桥接口处于混杂模式。请注意,从属于桥接端口的所有设备都处于混杂模式,因为此模式是桥接工作所必需的。但是,网桥本身并不处于混杂模式,除非您明确将其配置为混杂模式。

  • The bridge interface is in promiscuous mode. Note that all devices enslaved to a bridge port are in promiscuous mode because this mode is necessary for the bridge to work. However, the bridge itself is not in promiscuous mode unless you explicitly configure it to be.

  • 由于前面提到的原因之一,框架被淹没。

  • The frame is flooded for one of the reasons mentioned earlier.

  • 根据转发数据库,​​目的MAC地址属于本地接口。

  • According to the forwarding database, the destination MAC address belongs to a local interface.

br_handle_frame_finish 函数

图 16-13。br_handle_frame_finish 函数

Figure 16-13. br_handle_frame_finish function

看看本地交付是如何处理的很有趣。图 16-14是图 16-11的子集,显示了最终传递到上层协议处理程序的入口帧的确切路径。

It is interesting to see how the local delivery is handled. Figure 16-14, which is a subset of Figure 16-11, shows the exact path of an ingress frame that ends up passed to the upper-layer protocol handler.

当NIC的设备驱动程序接收到该帧时,skb->dev将其初始化为真实设备。然后该帧被推入网络堆栈并最终传递到br_pass_frame_up。该函数穿过另一个 Netfilter 挂钩,然后调用br_pass_frame_up_finish. 这里skb->dev被替换为入口端口所属的桥接设备并netif_receive_skb再次被调用。这次,handle_bridge看到设备不是从属设备(即为br_portNULL)并将帧交给正确的协议处理程序,如第 13 章所述。

When the frame is received by the NIC's device driver, skb->dev is initialized to the real device. The frame is then pushed up the network stack and eventually passed to br_pass_frame_up. That function crosses through another Netfilter hook and then calls br_pass_frame_up_finish. Here skb->dev is replaced with the bridge device the ingress port is part of and netif_receive_skb is invoked again. This time, handle_bridge sees that the device is not an enslaved device (i.e., br_port is NULL) and hands the frame to the right protocol handler, as described in Chapter 13.

入口数据帧的本地传送

图 16-14。入口数据帧的本地传送

Figure 16-14. Local delivery of ingress data frames

在桥接设备上传输

Transmitting on a Bridge Device

我们在“桥接设备抽象”一节中看到,桥接设备抽象要求将桥接设备上的传输转换为一个或所有桥接端口上的传输。图 16-11显示了实现这一点的关键例程。

We saw in the section "Bridge Device Abstraction" that the bridge device abstraction requires transmissions on a bridge device to be converted into transmissions on one or all bridge ports. Figure 16-11 shows the key routines that make this happen.

桥驱动程序的实现hard_start_xmitbr_dev_xmit. 后一个函数简单地实现了桥接器用于传输的基本逻辑。当网桥转发数据库中的查找返回成功时,它将帧复制出正确的网桥端口。另一方面,当查找失败或目标 MAC 地址是 L2 多播或 L2 广播地址时,它会在所有符合条件的桥接端口上泛洪帧。

The bridge driver's implementation of hard_start_xmit is br_dev_xmit. The latter function simply implements the basic logic used by a bridge to transmit. It copies the frame out of the right bridge port when a lookup in the bridge forwarding database returns success. On the other hand, it floods the frame on all eligible bridge ports when the lookup fails, or when the destination MAC address is either an L2 multicast or L2 broadcast address.

生成树协议 (STP)

Spanning Tree Protocol (STP)

我们在第 15 章中了解了STP 的工作原理。在本章中,我们将主要了解如何:

We saw in Chapter 15 how the STP works. In this chapter, we will mainly see how:

  • 处理入口 BPDU

  • Ingress BPDUs are processed

  • 发送出口 BPDU

  • Egress BPDUs are transmitted

  • 定时器被处理

  • Timers are handled

关键生成树例程

Key Spanning Tree Routines

以下是生成树代码用于实现 第 15 章中描述的逻辑的关键例程的列表:

Here is a list of the key routines used by the spanning tree code to implement the logic described in Chapter 15:

br_become_root_bridge
br_become_root_bridge

br_is_root_bridge
br_is_root_bridge

br_become_root_bridge使非根桥成为根桥。此任务包括停止 TCN 计时器(因为它不应在根桥上运行)和启动 Hello 计时器(该计时器仅在根桥上运行)。该函数还将其他计时器更新为本地配置的值,并启动拓扑更改。br_is_root_bridge当输入桥是根桥时返回 1,否则返回 0。

br_become_root_bridge makes a nonroot bridge the root bridge. This task consists of stopping the TCN timer, because it should not run on the root bridge, and starting the Hello timer, which runs only on the root bridge. The function also updates other timers to locally configured values, and starts a topology change. br_is_root_bridge returns 1 when the input bridge is the root bridge, and 0 otherwise.

br_should_become_designated_port
br_should_become_designated_port

br_designated_port_selection
br_designated_port_selection

br_become_designated_port
br_become_designated_port

br_is_designated_port
br_is_designated_port

br_is_designated_for_some_port
br_is_designated_for_some_port

br_should_become_designated_port如果应为输入端口分配指定角色,则返回 1,否则返回 0。 br_designated_port_selection循环遍历所有网桥端口,并将指定角色分配给那些应得的端口(请参阅第 15 章中的“指定端口选择”部分)。

br_become_designated_port将指定角色分配给桥接端口。br_is_designated_port当输入端口是指定端口时返回1,否则返回0。给定一个网桥,br_is_designated_for_some_port如果该网桥至少有一个具有指定角色的端口,则返回 1,否则返回 0。

br_should_become_designated_port returns 1 if the input port should be assigned the designated role, and 0 otherwise. br_designated_port_selection loops over all the bridge ports and it assigns the designated role to those that deserve it (see the section "Designated Port Selection" in Chapter 15).

br_become_designated_port assigns the designated role to a bridge port. br_is_designated_port returns 1 when the input port is a designated port, and 0 otherwise. Given a bridge, br_is_designated_for_some_port returns 1 if the bridge has at least one port with the designated role, and 0 otherwise.

br_supersedes_port_info
br_supersedes_port_info

给定一个桥接端口和该端口上收到的输入配置 BPDU,如果该 BPDU 比桥接端口已知的 BPDU 优越(即具有更好的优先级向量),则该函数返回 1,否则返回 0。

Given a bridge port and an input configuration BPDU received on the port, this function returns 1 if the BPDU is superior (i.e., has a better priority vector) than the one known to the bridge port, and 0 otherwise.

br_should_become_root_port
br_should_become_root_port

br_root_selection
br_root_selection

给定桥接端口和当前根端口,br_should_become_root_port将第一个端口的优先级向量与当前根端口的优先级向量进行比较,如果第一个端口具有更好的优先级向量(因此应该优先于当前根端口),则返回 1 )。否则返回 0。给定一个网桥,选择根端口,如第 15 章“根端口选择br_root_selection部分所述。

Given a bridge port and the current root port, br_should_become_root_port compares the priority vector of the first port against the priority vector of the current root port and returns 1 if the first port has a better priority vector (and therefore should be preferred over the current root port). It returns 0 otherwise. Given a bridge, br_root_selection selects the root port as described in the section "Root Port Selection" in Chapter 15.

br_configuration_update
br_configuration_update

给定一个网桥,确定根端口和指定端口并返回该信息。

Given a bridge, determines the root port and designated ports and returns that information.

br_port_state_selection
br_port_state_selection

给定一个网桥,为每个网桥端口选择正确的端口状态。

Given a bridge, selects the right port state for each bridge port.

br_topology_change_detection
br_topology_change_detection

br_topology_change_acknowledge
br_topology_change_acknowledge

br_topology_change_acknowledged
br_topology_change_acknowledged

br_topology_change_detection处理拓扑更改的检测,区分根桥和非根桥检测到的拓扑更改。br_topology_change_acknowledge通过发送设置了 TCA 标志的配置 BPDU 来确认 TCN 的接收。br_topology_change_acknowledged停止 TCN 定时器。

br_topology_change_detection handles the detection of a topology change, distinguishing between a topology change that is detected by a root bridge and a nonroot bridge. br_topology_change_acknowledge acknowledges the reception of a TCN by transmitting a configuration BPDU with the TCA flag set. br_topology_change_acknowledged stops the TCN timer.

br_record_config_information
br_record_config_information

br_record_config_timeout_values
br_record_config_timeout_values

给定一个桥接端口和一个入口配置 BPDU,br_record_config_information在端口的net_bridge_port数据结构上记录 BPDU 的优先级向量,并重新启动消息老化定时器,并 br_record_config_timeout_values记录 BPDU 中的定时器配置(参见第 15 章中的图 15-8)。

Given a bridge port and an ingress configuration BPDU, br_record_config_information records the priority vector of the BPDU on the port's net_bridge_port data structure and restarts the message age timer, and br_record_config_timeout_values records the timer configuration that is in the BPDU (see Figure 15-8 in Chapter 15).

br_get_port
br_get_port

给定桥接设备和端口号,返回关联的net_bridge_port结构。

Given a bridge device and a port number, returns the associated net_bridge_port structure.

网桥 ID 和端口 ID

Bridge IDs and Port IDs

我们在第 15 章的“网桥和端口 ID ”部分看到了网桥 ID 如何和端口 ID被定义。虽然两个 ID 的优先级部分都被分配了一个可由系统管理员覆盖的默认值,但桥接 ID 的 MAC 地址部分和端口 ID 的端口号部分由内核初始化,如下所示:

We saw in the section "Bridge and Port IDs" in Chapter 15 how bridge IDs and port IDs are defined. While the priority component of both IDs is assigned a default value that can be overridden by the system administrator, the MAC address component of the bridge ID and the port number component of the port ID are initialized by the kernel as follows:

桥接MAC地址
Bridge MAC address

选择从属设备上配置的最低 MAC 地址。br_stp_recalculate_bridge_id无论何时创建或删除新的桥接端口,以及当从属设备更改其 MAC 地址时(请参阅“ netdevice 通知链”部分),都会完成选择。

The lowest MAC address among the ones configured on the enslaved devices is selected. The selection is done with br_stp_recalculate_bridge_id anytime a new bridge port is created or deleted, and when an enslaved device changes its MAC address (see the section "netdevice Notification Chain").

端口号
Port number

选择范围 1− 中BR_MAX_PORTS尚未使用的第一个数字。该选择是在创建桥接端口时完成的(请参阅“将端口添加到桥接find_portno”部分)。

The first number in the range 1−BR_MAX_PORTS that is not already in use is selected. The selection is done with find_portno when the bridge port is created (see the section "Adding Ports to a Bridge").

在桥接设备上启用生成树协议

Enabling the Spanning Tree Protocol on a Bridge Device

您将在第 17 章中看到如何为每个桥接设备打开和关闭 STP。stp_enabled端口结构的字段指示net_bridge桥接设备是否启用。

You will see in Chapter 17 how the STP can be turned on and off for each bridge device. The stp_enabled field of the port's net_bridge structure indicates whether the bridge device is enabled.

当不使用STP时,“重要数据结构”部分中列出的大多数数据结构都包含不需要的字段,包括计时器。此外,不传输任何 BPDU,也不处理任何入口 BPDU。

When the STP is not in use, most of the data structures listed in the section "Important Data Structures" include fields that are not needed, including timers. In addition, no BPDUs are transmitted and no ingress BPDUs are processed.

您可能期望在禁用 STP 的情况下创建网桥时,只会初始化正确的字段和计时器,并且当稍后stp_enabled设置为启用 STP 时,将初始化并启动必要的附加字段和计时器。然而,Linux 的行为有所不同。

You would probably expect that when a bridge is created with STP disabled, only the right fields and timers would be initialized, and that when stp_enabled is set to enable STP later, the necessary additional fields and timers would be initialized and started. However, Linux behaves differently.

当桥接设备或端口初始化时,无论是否启用 STP,其所有字段(包括 STP 使用的字段)都会被初始化。你好计时器根桥上的 STP 用于传输 BPDU 的 也已启动。这样,如果稍后启用 STP,所有数据结构都将准备就绪。

When a bridge device or port is initialized, all its fields (including those used by STP) are initialized, regardless of whether STP is enabled. The Hello timer , which is used by STP on root bridges to transmit BPDUs, is also started. This way, if STP is enabled later, all data structures will be ready to go.

根据 STP,每当 Hello 定时器到期时,网桥就应该从其指定端口发送 BPDU。因为无论 STP 是否启用,网桥的计时器都会运行,因此传输例程始终检查 的值stp_enabled,并在字段表明 STP 已禁用时立即返回。一旦启用 STP,通过设置stp_enabled,BPDU 传输立即开始。在桥接设备很少的系统上,使用定期到期而不执行任何操作的计时器并不是很大的资源浪费,但无论如何都应该避免。在具有相当多网桥实例的系统上,在不需要时运行 Hello 计时器可能会严重浪费 CPU 时间。

Every time the Hello timer expires, according to STP, a bridge is supposed to transmit BPDUs out of its designated ports. Because a bridge's timer runs regardless of whether STP is enabled, the transmit routine always checks the value of stp_enabled and returns immediately when the field says STP is disabled. As soon as STP is enabled, by setting stp_enabled, BPDU transmissions start right away. On a system with few bridge devices, to have a timer that expires regularly to do nothing is not a big waste of resources, but should be avoided anyway. On a system with quite a few bridge instances, having the Hello timer run when it is not needed can be a significant waste of CPU time.

处理入口 BPDU

Processing Ingress BPDUs

入口 BPDU 被传递到br_stp_handle_bpdu图 16-15),后者更新转发数据库并根据其类型将它们交给正确的例程,或者在满足以下任何条件时丢弃它们:

Ingress BPDUs are passed to br_stp_handle_bpdu (Figure 16-15), which updates the forwarding database and hands them to the right routine based on its type, or discards them when any of the following conditions is met:

  • 框架被截断。

  • The frame is truncated.

  • 接收该帧的桥接设备或桥接端口已被禁用。

  • Either the bridge device or the bridge port that received the frame is disabled.

  • 桥接设备上的 STP 已禁用。这种情况并不常见,因为当桥接设备上禁用 STP 时,br_handle_frame不会将 BPDU 交给(请参阅“数据帧与 BPDU ”部分)。br_stp_handle_bpdu

  • The STP is disabled on the bridge device. This case is uncommon because br_handle_frame does not hand BPDUs to br_stp_handle_bpdu when STP is disabled on the bridge device (see the section "Data Frames Versus BPDUs").

  • 网桥不知道如何解释 BPDU 消息。由于 Linux 内核仅实现 IEEE 802.1D STP,因此它仅接受配置和 TCN BPDU。任何其他 BPDU 类型都将被丢弃。

  • The bridge does not know how to interpret the BPDU message. Because the Linux kernel implements only the IEEE 802.1D STP, it accepts only configuration and TCN BPDUs. Any other BPDU type is discarded.

br_stp_handle_bpdu函数

图 16-15。br_stp_handle_bpdu函数

Figure 16-15. br_stp_handle_bpdu function

br_received_config_bpdu按照第15章发送配置BPDU中 描述的逻辑处理配置BPDU 。

br_received_config_bpdu processes configuration BPDUs according to the logic described in the section "Transmitting Configuration BPDUs" in Chapter 15.

br_received_tcn_bpdu根据第15章“让所有网桥知道拓扑变化”一节中描述的逻辑处理TCN BPDU 。

br_received_tcn_bpdu processes TCN BPDUs according to the logic described in the section "Letting All Bridges Know About a Topology Change" in Chapter 15.

请注意,任何网桥的 BPDU 处理都是通过网桥锁串行化的:不同的 CPU 不能同时处理同一网桥的 BPDU。

Note that BPDU processing for any bridge is serialized with the bridge lock: different CPUs cannot process BPDUs concurrently for the same bridge.

发送BPDU

Transmitting BPDUs

我们在第 15 章的“发送配置 BPDU ”和“让所有网桥知道拓扑更改”部分中看到了何时发送配置和 TCN BPDU 。这里是传输的例程:

We saw when configuration and TCN BPDUs are transmitted in the sections "Transmitting Configuration BPDUs" and "Letting All Bridges Know About a Topology Change" in Chapter 15. Here are the transmitting routines:

br_transmit_config
br_transmit_config

按照第15章发送配置BPDU的逻辑发送配置BPDU 。

Transmits a configuration BPDU according to the logic in the section "Transmitting Configuration BPDUs" in Chapter 15."

br_transmit_tcn
br_transmit_tcn

传输 TCN BPDU。

Transmits a TCN BPDU.

br_reply
br_reply

使用另一个配置 BPDU 回复一个入口配置 BPDU。它是一个简单的包装br_transmit_config

Replies to an ingress configuration BPDU with another configuration BPDU. It is a simple wrapper around br_transmit_config.

所有BPDU传输都经过Netfilter钩子,如图16-16NF_BR_LOCAL_OUT所示。

All BPDU transmissions go through the NF_BR_LOCAL_OUT Netfilter hook, as shown in Figure 16-16.

传输例程

图 16-16。传输例程

Figure 16-16. Transmit routines

我们在第 15 章的“ BPDU 老化”部分中看到,配置 BPDU 通过嵌入的消息老化字段强制具有有限的生命周期。以下是非根网桥在中继 BPDU 之前更新该字段的方式。

We saw in the section "BPDU Aging" in Chapter 15 that Configuration BPDUs have a limited lifetime enforced through the embedded Message Age field. Here is how nonroot bridges update that field before relaying the BPDU.

当收到配置 BPDU 时,br_stp_handle_bpdu将 BPDU 的 Message Age 字段保存在本地变量中。当调用它来传输 BPDU 时,它会更新消息年龄字段,添加自收到原始帧br_transmit_config以来经过的时间量。br_stp_handle_bpdu因为消息年龄计时器以1/256秒的倍数表示,但内核更好管理的时间单位是tick,br_stp_handle_bpdu 保存时将消息年龄转换为tick。br_transmit_config稍后以滴答为单位计算经过的时间,但将结果转换回 1/256 秒的单位,以便将其写入 BPDU。转换是通过br_get_tick和进行的br_set_tick

When a configuration BPDU is received, br_stp_handle_bpdu saves the Message Age field of the BPDU in a local variable. When br_transmit_config is called to transmit a BPDU out, it updates the Message Age field, adding the amount of time that passed since br_stp_handle_bpdu received the original frame. Because the Message Age timer is expressed in multiples of 1/256th of a second, but the unit of time the kernel manages better is ticks, br_stp_handle_bpdu converts the message age to ticks when saving it. br_transmit_config later computes the elapsed time in ticks, but converts the result back into units of 1/256th of a second so that it can write it to the BPDU. The conversions are made with br_get_tick and br_set_tick.

配置更新

Configuration Updates

我们在第 15 章中看到系统管理员如何使用配置参数来影响 STP 将收敛到的拓扑。我们还看到了网桥端口的角色和状态的选择如何取决于网桥的当前知识以及通过入口配置 BPDU 接收到的信息,特别是优先级向量组件(参见第 15 章中的15-8 最后,我们在第 15 章的“定义活动拓扑” 部分看到了可能触发网桥上的配置更新的事件,以及配置更新的组成。

We saw in Chapter 15 how the system administrator can use configuration parameters to affect the topology to which the STP will converge. We also saw how the selection of role and state for a bridge port depends on the current knowledge of the bridge and the information received with ingress configuration BPDUs, in particular the priority vector component (see Figure 15-8 in Chapter 15). Finally, we saw in the section "Defining the Active Topology" in Chapter 15 the events that may trigger a configuration update on a bridge, and what a configuration update consists of.

负责配置更新的例程是br_configuration_update表 16-1显示了调用该例程的位置和时间。

The routine that takes care of configuration updates is br_configuration_update. Table 16-1 shows where and when that routine is invoked.

表 16-1。触发配置更新的例程

Table 16-1. Routines that trigger a configuration update

在哪里

Where

什么时候

When

br_received_config_bpdu

br_received_config_bpdu

网桥端口收到具有更好优先级向量的 BPDU。

A BPDU with a better priority vector is received on a bridge port.

br_message_age_timer_expired

br_message_age_timer_expired

桥接端口已知的信息已过期。请参见第15 章中的“ BPDU 老化”部分。

The information known to a bridge port has expired. See the section "BPDU Aging" in Chapter 15.

br_stp_disable_port

br_stp_disable_port

桥接端口已被禁用。

A bridge port has been disabled.

br_stp_change_bridge_id

br_stp_change_bridge_id

网桥 ID 的 MAC 地址部分已更改。请参阅“网桥 ID 和端口 ID ”部分。

The MAC address component of the bridge ID has been changed. See the section "Bridge IDs and Port IDs."

br_stp_set_bridge_priority

br_stp_set_bridge_priority

网桥 ID 的优先级部分已更改。

The priority component of the bridge ID has been changed.

br_stp_set_path_cost

br_stp_set_path_cost

端口路径成本已更改。

The port path cost has been changed.

每次调用 后br_configuration_update始终都会调用br_port_state_selection,它负责根据分配的角色更新每个网桥端口的状态。使用“更改网桥端口上的状态”部分中介绍的例程来应用状态更改。

Each call to br_configuration_update is always followed by a call to br_port_state_selection, which takes care of updating the state for each bridge port based on its assigned role. State changes are applied using the routines introduced in the section "Changing State on a Bridge Port."

在第17 章的“处理配置更改”部分中,您可以找到导致执行表16-1中的某些例程的用户命令。

In the section "Handling Configuration Changes" in Chapter 17, you can find the user commands that lead to the execution of some of the routines in Table 16-1.

根桥选择

Root Bridge Selection

我们在第 15 章的“根桥选择”一节中看到了根桥如何被选中。当网桥首次启用时,它认为自己是根网桥。此后,根据通过入口配置 BPDU 接收到的信息以及系统管理员应用的配置,根桥状态可以更改。

We saw in the section "Root Bridge Selection" in Chapter 15 how the root bridge is selected. When a bridge is first enabled, it believes it is the root bridge. Thereafter, based on the information received with ingress configuration BPDUs and the configuration applied by the system administrator, the root bridge status can change.

图 16-17显示了可以更改网桥根状态的事件如何完成这项工作。

Figure 16-17 shows how events that can change the root status of a bridge do the job.

(a) 成为根桥; (b)放弃根桥角色

图 16-17。(a) 成为根桥;(b)放弃根桥角色

Figure 16-17. (a) Becoming the root bridge; (b)giving up the root bridge role

成为根桥

Becoming the root bridge

处理使非根桥成为根桥的事件的例程遵循图16-17(a)中的方案:首先保存根状态,然后更新端口角色和状态。如果此更新使网桥成为根网桥,则将应用所需的操作,例如启动和停止正确的计时器。

The routines that process events that can make a nonroot bridge become the root bridge follow the scheme in Figure 16-17(a): first the root status is saved, and then port roles and states are updated. If this update makes the bridge become the root bridge, the required actions are applied, such as starting and stopping the right timers.

以下是触发配置更新并可能选择非根网桥作为新根网桥的例程:

Here are the routines that trigger a configuration update and that may elect a non-root bridge as the new root bridge:

br_stp_change_bridge_id
br_stp_change_bridge_id

br_stp_set_bridge_priority
br_stp_set_bridge_priority

当网桥的 MAC 地址和网桥的优先级发生更改时,将分别调用这些函数。因为这两个字段是组成网桥ID的,并且根桥的选举是基于网桥ID的,所以它们的任何改变都可能改变根桥。

These are called when the bridge's MAC address and the bridge's priority are changed, respectively. Because these are the two fields of which bridge IDs are composed, and because the election of the root bridge is based on the bridge ID, any change in them may change the root bridge.

br_stp_disable_port
br_stp_disable_port

当您禁用网桥可用于到达当前根网桥的唯一端口时,生成树将被分区,并且必须在当前网桥所属的分区上选择一个新的根网桥。这就是为什么第 15 章示例中的大多数网桥都有冗余链路的原因。

When you disable the only port that a bridge can use to reach the current root bridge, the spanning tree is partitioned and a new root bridge has to be selected on the partition the current bridge is part of. This is why most of the bridges in the examples in Chapter 15 have redundant links.

br_message_age_timer_expired
br_message_age_timer_expired

当指定网桥接收到的端口信息过期时(很可能是因为指定网桥不再是指定网桥,或者只是因为它发生故障),该端口将被分配指定角色。由于这是拓扑的变化,因此根桥也可能发生变化。

When the information a port received by the designated bridge expires (most likely because the designated bridge is not the designated bridge anymore, or because it simply failed), the port is assigned the designated role. Since this is a change in the topology, it is possible that the root bridge changes, too.

放弃根桥角色

Giving up the root bridge role

当根桥收到具有更高优先级向量的 BPDU 时,它就会放弃其角色。该情况如图 16-17(b)br_received_config_bpdu中所述进行检测和处理。

A root bridge relinquishes its role when it receives a BPDU with a superior priority vector. That condition is detected in br_received_config_bpdu and is handled as described in Figure 16-17(b).

定时器

Timers

我们看到了每个端口和每个桥的计时器由 STP 使用代码见第 15 章“定时器”部分。

We saw the per-port and per-bridge timers used by the STP code in the section "Timers" in Chapter 15.

br_stp_port_timer_init端口和桥接定时器分别用和初始化br_stp_timer_init

Port and bridge timers are initialized with br_stp_port_timer_init and br_stp_timer_init, respectively.

表 16-216-3列出了定时器到期时执行的例程。所有这些例程以及两个初始化例程都在net/bridge/br_stp_timer.c中定义。所有计时器处理程序都在保持桥锁的情况下运行。

Tables 16-2 and 16-3 list the routines that are executed when the timers expire. All of these routines, plus the two initialization routines, are defined in net/bridge/br_stp_timer.c. All timer handlers run with the bridge lock held.

表 16-2。STP 网桥定时器的处理程序

Table 16-2. Handlers for the STP bridge's timers

定时器

Timer

处理程序

Handler

你好

Hello

br_hello_timer_expired

br_hello_timer_expired

拓扑变化通知

Topology Change Notification

br_tcn_timer_expired

br_tcn_timer_expired

拓扑变化

Topology Change

br_topology_change_timer_expired

br_topology_change_timer_expired

表 16-3。STP 端口定时器的处理程序

Table 16-3. Handlers for the STP port's timers

定时器

Timer

处理程序

Handler

最大年龄

Max Age

br_message_age_timer_expired

br_message_age_timer_expired

转发延迟

Forward Delay

br_forward_delay_timer_expired

br_forward_delay_timer_expired

抓住

Hold

br_hold_timer_expired

br_hold_timer_expired

处理拓扑变化

Handling Topology Changes

我们在第 15 章的“拓扑更改”部分中看到了被视为拓扑更改的事件。这些事件由以下例程检测:

We saw in the section "Topology Changes" in Chapter 15 the events that are considered topology changes. These events are detected by the following routines:

br_make_blocking
br_make_blocking

当 STP 决定阻止转发端口时调用。

Called when the STP has decided to block a forwarding port.

br_forward_delay_timer_expired
br_forward_delay_timer_expired

BR_STATE_LEARNING 当处于该状态(即尚未转发)的端口移动到该BR_STATE_FORWARDING状态时调用。

Called when a port in the BR_STATE_LEARNING state (i.e., not yet forwarding) is moved to the BR_STATE_FORWARDING state.

br_become_root_bridge
br_become_root_bridge

当非根桥成为根桥时调用。有关何时调用此例程的信息,请参阅“成为根桥”部分。

Called when a nonroot bridge becomes the root bridge. See the section "Becoming the root bridge" for when this routine is invoked.

br_received_tcn
br_received_tcn

当网桥端口收到 TCN BPDU 时调用。有关调用此例程的时间,请参阅“处理入口 BPDU ”部分。

Called when a TCN BPDU is received on a bridge port. See the section "Processing Ingress BPDUs" for when this routine is invoked.

netdevice通知链

netdevice Notification Chain

由于虚拟桥接设备被定义为真实(从属)设备之上的抽象,因此当其任何从属设备更改状态时,桥接设备可能会受到影响。因此,桥接子系统的初始化例程(在“桥接代码的初始化”部分中简要描述)会br_device_eventnetdevice通知链注册回调。桥接代码仅对受控设备感兴趣,因此任何有关非受控设备的通知都不感兴趣,也不需要注意。

Because the virtual bridge device is defined as an abstraction on top of real (enslaved) devices, the bridge device is likely to be affected when any of its enslaved devices change status. For this reason, the bridging subsystem's initialization routine, briefly described in the section "Initialization of Bridging Code," registers the br_device_event callback with the netdevice notification chain. The bridging code is interested only in enslaved devices, so any notification regarding a nonenslaved device is of no interest and does not need attention.

以下是处理每个收到的事件通知的方式:

Here is how each received event notification is processed:

NETDEV_CHANGEMTU
NETDEV_CHANGEMTU

桥接设备的 MTU 会更新,以反映从属设备上配置的最小 MTU。

The MTU for the bridge device is updated to reflect the minimum MTU among the ones configured on the enslaved devices.

NETDEV_CHANGEADDR
NETDEV_CHANGEADDR

当从属设备更改其 MAC 地址时,转发数据库中的条目会更新br_fdb_changeaddr,网桥 ID 也会更新,以反映我们在“网桥 ID 和端口 IDbr_stp_recalculate_bridge_id ”部分中看到的规则。

When an enslaved device changes its MAC address, its entry in the forwarding database is updated with br_fdb_changeaddr and the bridge ID is updated with br_stp_recalculate_bridge_id to reflect the rule we saw in the section "Bridge IDs and Port IDs."

NETDEV_CHANGE
NETDEV_CHANGE

此通知可用于多种目的。桥接子系统只对载体状态的变化感兴趣。

br_stp_enable_port当从属设备丢失或检测到其运营商状态时,将分别使用 和 启用和禁用关联的桥接端口br_stp_disable_port。当与该设备关联的桥接设备被管理员保留(即IFF_UP未设置)时,该通知将被忽略。

This notification can be used for various purposes. The bridging subsystem is interested only in changes to the carrier status.

When an enslaved device loses or detects its carrier status, the associated bridge port is enabled and disabled with br_stp_enable_port and br_stp_disable_port, respectively. When the bridge device this device is associated with is left down by the administrator (i.e., IFF_UP is not set), the notification is ignored.

NETDEV_FEATCHANGE
NETDEV_FEATCHANGE

当从属设备的功能发生变化时,桥接设备的功能集也会更新,以br_features_recompute反映其所有真实设备共有的功能集。

When features of an enslaved device change, the feature set of the bridge device is updated with br_features_recompute to reflect the set of features common to all of its real devices.

NETDEV_DOWN
NETDEV_DOWN

当管理员禁用从属设备时,关联的桥接端口也必须被禁用;这是由 处理的br_stp_disable_port。当与端口关联的网桥已经关闭时,这不是必需的,因为这意味着桥接端口也已经关闭。

When an enslaved device is disabled by the administrator, the associated bridge port must be disabled, too; this is handled by br_stp_disable_port. This is not necessary when the bridge the port is associated with is already down, because that would imply that the bridge port is already down, too.

NETDEV_UP
NETDEV_UP

当从属设备被管理员启用(即IFF_UP设置)时,br_stp_enabled_port如果它具有运营商状态并且关联的桥接设备也已启动,则关联的桥接端口将被启用。

When an enslaved device is enabled by the administrator (i.e., IFF_UP is set), the associated bridge port is enabled with br_stp_enabled_port if it has the carrier status and the associated bridge device is up, too.

NETDEV_UNREGISTER
NETDEV_UNREGISTER

当从属设备取消注册时,关联的桥接端口将被删除br_del_if

When an enslaved device is unregistered, the associated bridge port is removed with br_del_if.

除 之外NETDEV_UNREGISTER,所有事件均在持有桥锁的情况下进行处理。

With the exception of NETDEV_UNREGISTER, all events are processed with the bridge lock held.




[ * ]只有系统管理员可以更改接口的 MAC 地址,这需要显式命令,例如ip link set eth0 address 00:20:ED:76:1E:12ifconfig eth0 hw ether 00:20:ED:76 :1E:12。更改接口的 MAC 地址的情况很少见。

[*] Only the system administrator can change the MAC address of an interface, which requires an explicit command like ip link set eth0 address 00:20:ED:76:1E:12 or ifconfig eth0 hw ether 00:20:ED:76:1E:12. Changing the MAC address of an interface is rarely done.

第 17 章桥接:其他主题

Chapter 17. Bridging: Miscellaneous Topics

在前面的章节中,我们了解了桥接和 STP 的实现方式,以及它们如何融入网络堆栈。在本章中,我们通过描述子系统如何与配置桥接的用户空间命令进行交互来结束本书的桥接部分。我不会描述命令本身,因为管理超出了本书的范围。

In the previous chapters, we saw how bridging and the STP are implemented, and how they fit into the network stack. In this chapter, we conclude the bridging part of the book with a description of how the subsystem interacts with the user-space commands that configure bridging. I will not describe the commands themselves, because administration is outside the scope of this book.

我们还将查看/sys目录中导出的可用于调整桥接的各种文件。本章最后详细描述了第 16 章中介绍的数据结构。

We will also look at the various files exported in the /sys directory that can be used to tune bridging. The chapter concludes with a detailed description of the data structures introduced in Chapter 16.

用户空间配置工具

User-Space Configuration Tools

可以使用brctl配置桥接,该实用程序可以从 http://bridge.sourceforge.net/下载。通过 brctl,您可以创建桥接设备、将网卡从属到桥接设备,以及配置STP的桥接参数和桥接端口参数。

Bridging can be configured with brctl, a utility you can download at http://bridge.sourceforge.net/. With brctl, you can create bridge devices, enslave NICs to bridge devices, and configure bridge parameters and bridge port parameters for the STP.

brctl使用该ioctl接口与内核通信,除非安装了libsysfs库,在这种情况下, sysfs接口将成为首选。libsysfs库可以从http://linux-diag.sourceforge.net/Sysfsutils.html下载,它提供了访问和修改/ sys中导出的变量内容所需的所有原语 。请参阅“通过 /sys 文件系统进行调整”部分。

brctl uses the ioctl interface to talk to the kernel unless the libsysfs library is installed, in which case the sysfs interface becomes the preferred choice. The libsysfs library, which can be downloaded at http://linux-diag.sourceforge.net/Sysfsutils.html, provides all the necessary primitives to access and modify the content of the variables exported in /sys. See the section "Tuning via /sys Filesystem."

在第16章的“数据帧与BPDU ”部分中,我们介绍了ebtables。用户空间配置工具可以在http://ebtables.sourceforge.net下载。我们不会在本章中讨论它;您可以在其主页上找到非常好的文档。

In the section "Data Frames Versus BPDUs" in Chapter 16, we introduced ebtables. The user-space configuration tool can be downloaded at http://ebtables.sourceforge.net. We will not look at it in this chapter; you can find pretty good documentation on its home page.

处理配置更改

Handling Configuration Changes

表 17-1列出了配置层调用以通知桥接有关更改的内核桥接代码的brctl命令和回调例程。例如,当您使用像brctl addbr br0 这样的命令创建桥接设备br0时,内核最终会调用我们在第 16 章“创建桥接设备和桥接端口”部分中描述的例程。br_add_bridge

Table 17-1 lists the brctl commands and the callback routines of the kernel bridging code that the configuration layer calls to notify bridging about the changes. For example, when you create the bridge device br0 with a command like brctl addbr br0, the kernel ends up calling br_add_bridge, the routine we described in the section "Creating Bridge Devices and Bridge Ports" in Chapter 16.

请注意,某些命令不需要调用任何回调例程。例如,如果您使用brctl sethello br0 3等命令更改 Hello 时间,则新值将立即对桥接代码可见:STP 无需采取任何操作。

Note that some commands do not need to invoke any callback routine. For example, if you change the Hello time with a command such as brctl sethello br0 3, the new value will be visible immediately to the bridging code: there is no need for any action to be taken by the STP.

表 17-1。brctl 命令和关联的内核处理程序

Table 17-1. brctl commands and associated kernel handlers

brctl命令

brctl command

描述

Description

桥接回调例程

Bridging callback routine

地址

addbr

创建桥接设备。

Create a bridge device.

br_add_bridge

br_add_bridge

德尔布

delbr

删除桥接设备。

Delete a bridge device.

br_del_bridge

br_del_bridge

加迪夫

addif

创建桥接端口。

Create a bridge port.

br_add_if

br_add_if

德利夫

delif

删除桥接端口。

Delete a bridge port.

br_del_if

br_del_if

设定

setageing

设置转发数据库地址的老化时间。

Set the aging time for the addresses in the forwarding database.

不适用

N/A

设置布里奇普里奥

setbridgeprio

设置网桥优先级。

Set the bridge priority.

br_stp_set_bridge_priority

br_stp_set_bridge_priority

设定值

setfd

设置转发延迟计时器。

Set the Forward Delay timer.

不适用

N/A

塞你好

sethello

设置 Hello 计时器。

Set the Hello timer.

不适用

N/A

设置最大寿命

setmaxage

设置最大年龄计时器。

Set the Max Age timer.

不适用

N/A

设置路径成本

setpathcost

设置端口路径开销。

Set the port path cost.

br_stp_set_path_cost

br_stp_set_path_cost

设置端口优先级

setportprio

设置端口优先级。

Set the port priority.

br_stp_set_port_priority

br_stp_set_port_priority

展示

show

显示桥接设备。

Show the bridge device.

不适用

N/A

秀马克斯

showmacs

显示网桥的转发数据库。

Show the forwarding database for a bridge.

不适用

N/A

显示stp

showstp

显示网桥的生成树信息。

Show the spanning tree information for a bridge.

不适用

N/A

斯特普

stp

启用或禁用网桥上的 STP。

Enable or disable the STP on a bridge.

不适用

N/A

无论brctl是通过命令还是通过sysfs与内核对话,都会使用表 17-1中的例程。无论给定命令是否需要调用桥接回调例程,总是调用内核例程来处理brctl命令。ioctl

The routines in Table 17-1 are used regardless of whether brctl talks to the kernel with ioctl commands or via sysfs. Regardless of whether a given command requires the invocation of a bridging callback routine, a kernel routine is always called to take care of the brctl command.

旧界面与新界面

Old Interface Versus New Interface

由于内核代码同时支持新旧接口,因此它必须能够正确处理这两个版本。不幸的是,这使得ioctl负责桥接配置命令的代码有点混乱。旧接口完全基于ioctl命令,而新接口ioctl仅用于命令的子集,而 sysfs用于其他命令。

Because the kernel code supports both the old and new interfaces, it must be able to handle both versions correctly. Unfortunately, this makes the ioctl code that takes care of bridging configuration commands a little messy. The old interface is completely based on ioctl commands, whereas the new one uses ioctl only for a subset of commands and sysfs for the others.

图 17(a)17(b)显示了两个接口的命令如何ioctl路由到正确的例程进行处理(我知道,这并不是所谓的清晰代码)。

Figures 17(a) and 17(b) show how ioctl commands for both interfaces are routed to the right routines for processing (I know, it's not really what you call clear and clean code).

调度 ioctl 命令

图 17-1a。调度 ioctl 命令

Figure 17-1a. Dispatching ioctl commands

sock_ioctl顶部的菱形是在 net/socket.c中完成的初始调度请注意,该图仅显示路由桥接命令所需的详细信息,即使某些例程也由其他功能的命令共享。颜色较浅的命令是新界面使用的命令。

The top diamond is the initial dispatching done in sock_ioctl in net/socket.c. Note that the figure shows only the details needed to route bridging commands, even though some of the routines are shared by other features' commands, too. The commands with a lighter color are the ones used by the new interface.

值得一提的一个细节是,br_ioctl_deviceless_stub如果桥接内核模块尚未在内存中,则尝试加载该模块。

One detail worth mentioning is that br_ioctl_deviceless_stub tries to load the bridge kernel module if it is not already in memory.

接下来的两节提供有关这两个接口的更多详细信息。

The next two sections offer some more details on the two interfaces.

调度 ioctl 命令

图 17-1b。调度 ioctl 命令

Figure 17-1b. Dispatching ioctl commands

创建桥接设备和桥接端口

Creating Bridge Devices and Bridge Ports

我将brctl命令分为两类:用于创建和删除桥接设备和桥接端口的命令,以及用于配置或转储桥接设备和桥接端口的配置(包括 STP 的详细信息)的命令。

I would divide brctl commands into two classes: those used to create and delete bridge devices and bridge ports, and those used to configure or dump the configuration of bridge devices and bridge ports (including details on the STP).

新旧界面均使用ioctl 命令来实现第一类命令。ioctl新旧接口使用的具体命令代码如表17-2所示。

Both the old and the new interfaces use ioctl commands to implement the first class of commands. The exact ioctl command codes used by the old and new interfaces are listed in Table 17-2.

表 17-2。用于创建桥接设备和端口的 ioctl 命令

Table 17-2. ioctl commands used for creating bridge devices and ports

brctl命令

brctl command

旧接口(ioctl 命令、参数)

Old interface (ioctl command, argument)

新接口(ioctl 命令)

New interface (ioctl command)

地址

addbr

SIOCSIFBR,BRCTL_ADD_BRIDGE

SIOCSIFBR, BRCTL_ADD_BRIDGE

SIOCBRADDBR

SIOCBRADDBR

德尔布

delbr

SIOCSIFBR,BRCTL_DEL_BRIDGE

SIOCSIFBR, BRCTL_DEL_BRIDGE

SIOCBRDELBR

SIOCBRDELBR

加迪夫

addif

SIOCDEVPRIVATE,BRCTL_ADD_IF

SIOCDEVPRIVATE, BRCTL_ADD_IF

SIOCBRADDIF

SIOCBRADDIF

德利夫

delif

SIOCDEVPRIVATE,BRCTL_DEL_IF

SIOCDEVPRIVATE, BRCTL_DEL_IF

SIOCBRDELIF

SIOCBRDELIF

请注意,旧接口需要通过命令传递参数ioctl来识别精确的brctl命令,而对于新接口,该ioctl命令就足够了。

Note that the old interface needs to pass an argument with the ioctl command to identify the precise brctl command, whereas for the new interface the ioctl command is sufficient.

配置桥接设备和端口

Configuring Bridge Devices and Ports

第二类命令在新旧接口中的实现方式不同:旧接口使用ioctl命令,新接口使用 sysfs

The second class of commands is implemented differently in the old and new interfaces: the old interface uses ioctl commands, and the new interface uses sysfs.

旧接口使用的具体ioctl命令代码列于表17-3

The exact ioctl command codes used by the old interface are listed in Table 17-3.

表 17-3。旧接口用于配置桥接设备和端口的 ioctl 命令

Table 17-3. ioctl commands used by the old interface for configuring bridge devices and ports

brctl命令

brctl command

ioctl 命令、参数

ioctl command, argument

a brctl版本 1.0.6使用BRCTL_SET_PATH_COST而不是BRCTL_SET_PORT_PRIORITY. 这可能是剪切和粘贴错误。

a Version 1.0.6 of brctl uses BRCTL_SET_PATH_COST rather than BRCTL_SET_PORT_PRIORITY. This is likely to be a cut-and-paste error.

设定

setageing

SIOCDEVPRIVATE,BRCTL_SET_AGEING_TIME

SIOCDEVPRIVATE, BRCTL_SET_AGEING_TIME

设置布里奇普里奥

setbridgeprio

SIOCDEVPRIVATE,BRCTL_SET_BRIDGE_PRIORITY

SIOCDEVPRIVATE, BRCTL_SET_BRIDGE_PRIORITY

设定值

setfd

SIOCDEVPRIVATE,BRCTL_SET_FORWARD_DELAY

SIOCDEVPRIVATE, BRCTL_SET_FORWARD_DELAY

塞你好

sethello

SIOCDEVPRIVATE,BRCTL_SET_HELLO_TIME

SIOCDEVPRIVATE, BRCTL_SET_HELLO_TIME

设置最大寿命

setmaxage

SIOCDEVPRIVATE,BRCTL_SET_MAX_AGE

SIOCDEVPRIVATE, BRCTL_SET_MAX_AGE

设置路径成本

setpathcost

SIOCDEVPRIVATE,BRCTL_SET_PATH_COST

SIOCDEVPRIVATE, BRCTL_SET_PATH_COST

设置端口优先级

setportprio

SIOCDEVPRIVATE,BRCTL_SET_PORT_PRIORITY一个

SIOCDEVPRIVATE, BRCTL_SET_PORT_PRIORITYa

展示

show

SIOCDEVPRIVATE,BRCTL_GET_BRIDGE_INFO

SIOCDEVPRIVATE, BRCTL_GET_BRIDGE_INFO

秀马克斯

showmacs

SIOCDEVPRIVATE,BRCTL_GET_FDB_ENTRIES

SIOCDEVPRIVATE, BRCTL_GET_FDB_ENTRIES

显示stp

showstp

SIOCDEVPRIVATE,BRCTL_SET_BRIDGE_INFO

SIOCDEVPRIVATE, BRCTL_SET_BRIDGE_INFO

斯特普

stp

SIOCDEVPRIVATE,BRCTL_SET_BRIDGE_STP_STATE

SIOCDEVPRIVATE, BRCTL_SET_BRIDGE_STP_STATE

请注意,所有brctl命令都使用该SIOCDEVPRIVATE命令(尽管 Linux 中几乎不推荐使用该命令)和一个标识确切操作的参数。

Note that all brctl commands use the SIOCDEVPRIVATE command (even though its use is pretty much deprecated in Linux) and an argument that identifies the exact operation.

基于sysfs的配置只需识别/sys中的正确文件并使用libsysfs库写入它即可。表 17-2中的操作无法通过sysfs实现 ,因为其层次结构中没有它们的文件。

The sysfs-based configuration simply identifies the right file in /sys and writes to it using the libsysfs library. The operations in Table 17-2 cannot be implemented via sysfs because there is no file in its hierarchy for them.

通过 /proc 文件系统进行调整

Tuning via /proc Filesystem

通用桥接代码不会在/proc文件系统中创建任何文件。然而,防火墙桥接扩展在/proc/sys/net/bridge/中创建了一些文件,可用于使net/bridge/br_netfilter.c中的核心例程返回,而不处理它们作为输入接收的缓冲区。这些文件由 来创建,当桥接代码初始化时br_netfilter_init会调用该文件(请参阅net/bridge/br_netfilter.c)。br_init

The generic bridging code does not create any file in the /proc filesystem. The firewall bridging extension, however, creates a few files in /proc/sys/net/bridge/ that can be used to make core routines in net/bridge/br_netfilter.c return without processing the buffer they receive as input. These files are created by br_netfilter_init, which is called by br_init when the bridging code gets initialized (see net/bridge/br_netfilter.c).

通过 /sys 文件系统进行调整

Tuning via /sys Filesystem

正如我在“用户空间配置工具”部分中所说,brctl可以通过sysfs接口配置网桥和 STP 参数。在了解内核如何处理来自brctl 的命令之前,让我们先看看/sys中的信息是如何组织的。

As I said in the section "User-Space Configuration Tools," brctl can configure bridge and STP parameters via the sysfs interface. Before seeing how the kernel processes commands from brctl, let's see how the information in /sys is organized.

内核在/sys/class/net中 为每个注册的网络设备创建一个目录。该目录用于导出一般适用于网络设备的只读和读写参数。桥接设备与任何其他网络设备一样分配有一个目录,其目录中包含两个特殊的子目录:bridgebrif。第一个导出网桥参数,第二个包含指向每个从属设备(即每个网桥端口)目录的软(符号)链接。图 17-2显示了具有两个以太网设备eth0eth1的系统示例,其中管理员创建了一个桥接设备 br0,并将eth0eth1都从属于br0

The kernel creates a directory in /sys/class/net for each registered network device. This directory is used to export both read-only and read-write parameters that apply to network devices in general. Bridge devices, which are assigned a directory as any other network device, include two special subdirectories in their directory: bridge and brif. The first exports bridge parameters, and the second includes a soft (symbolic) link to the directory of each enslaved device—that is, each bridge port. Figure 17-2 shows an example of a system with two Ethernet devices, eth0 and eth1, where the admin has created one bridge device, br0, and enslaved both eth0 and eth1 to br0.

br0目录包含另一个特定于桥的文件:brforward。这用于导出网桥转发数据库(以二进制格式)。您可以使用brctl showmacs命令转储它 。

The br0 directory includes another bridge-specific file: brforward. This is used to export the bridge forwarding database (in binary format). You can dump it with the brctl showmacs command.

bridge目录中的文件是数据结构的字段, brportnet_bridge目录中的文件 是数据结构的字段。[ * ]net_bridge_port

The files in the bridge directories are fields of the net_bridge data structure, and the files in the brport directories are fields of the net_bridge_port data structure.[*]

创建桥接设备时,桥接设备目录(上例中的br0)将填充桥接参数和目录(请参阅第 16 章中的“创建新桥接设备br_sysfs_addbr部分)。当设备被从属时,其目录(上例中的eth0eth1)将填充. 后者还填充桥的brif目录。br_sys_addif

The bridge device directory (br0 in the previous example) is populated with bridge parameters and directories by br_sysfs_addbr when the bridge device is created (see the section "Creating a New Bridge Device" in Chapter 16). When a device is enslaved, its directory (eth0 and eth1 in the previous example) is populated with br_sys_addif. The latter also populates the bridge's brif directory.

使用 sysfs 导出的桥接信息示例

图 17-2。使用 sysfs 导出的桥接信息示例

Figure 17-2. Example of bridge information exported with sysfs

图 17-2中的所有文件都是只读的,但颜色较浅的文件除外,它们也是可写的。例如,可写的是brctl用于通过libsysfs 配置网桥和网桥端口参数的那些。

All files in Figure 17-2 are read-only, with the exceptions of those with a lighter color, which are writable, too. The writable ones are, for example, those used by brctl to configure bridge and bridge port parameters via libsysfs.

与桥接设备目录中的文件交互的内核代码位于 net/bridge/br_sysfs_br.c中,与桥接端口设备(即从属设备)中的文件交互的代码位于 net/bridge/br_sysfs_if中.c .

The kernel code that interacts with the files in the bridge device directories is in net/bridge/br_sysfs_br.c, and the code that interact with the files in the bridge port devices (i.e., the enslaved devices) is in net/bridge/br_sysfs_if.c.

乍一看,代码可能看起来很复杂,但实际上非常简单且组织良好。对于创建的bridge 和brport 目录(每个网桥或端口属性)中的每个文件,代码定义了当使用特殊宏实例对文件发出读取或写入请求时要调用例程。让我们跳过有关如何将这些宏放入表中以及如何由br_sysfs_add xxx前面介绍的例程使用的详细信息,并查看一些它们的使用示例。

The code may look complex at first glance, but it is actually pretty simple and well organized. For each file in the bridge and brport directories (each bridge or port attribute) that is created, the code defines what routines to invoke when a read or write request is issued on the file with an instance of a special macro. Let's skip the details on how those macros are put into a table and used by the br_sysfs_add xxx routines introduced earlier, and see a couple of examples of their use.

静态 CLASS_DEV_ATTR(max_age, S_IURGO | S_IWUSR, show_max_age, store_max_age)
static CLASS_DEV_ATTR(max_age, S_IURGO | S_IWUSR, show_max_age, store_max_age)

net/bridge/br_sysfs_br.c中的此声明使用CLASS_DEV_ATTR宏来定义具有读写权限的max_age文件(仅限超级用户的写入权限)。当您读取文件时,内核使用show_max_age返回其内容,当您写入文件时,内核使用 进行更改store_max_age

This declaration in net/bridge/br_sysfs_br.c uses the CLASS_DEV_ATTR macro to define the max_age file with read-write permissions (write permissions for the superuser only). When you read the file, the kernel uses show_max_age to return its contents, and when you write to the file, the kernel carries out the change with store_max_age.

静态 BRPORT_ATTR(端口号,S_IURGO,显示端口号,NULL)
static BRPORT_ATTR(port_no, S_IURGO, show_port_no, NULL)

net/bridge/br_sysfs_if.c中的此声明 定义了port_no文件,具有只读权限。当您读取文件时,内核会show_port_no返回其内容。由于port_no文件是只读的,因此指定 NULL 来代替写入例程。

This declaration in net/bridge/br_sysfs_if.c defines the port_no file, with read-only permission. When you read the file, the kernel uses show_port_no to return its contents. Since the port_no file is read-only, NULL is specified in place of a write routine.

统计数据

Statistics

数据net_bridge结构包括数据结构的实例net_device_stats。每个网络设备都采用一种结构,如第12章“统计net_device_stats部分所述。桥接代码仅使用几个字段:

The net_bridge data structure includes an instance of a net_device_stats data structure. Each network device employs one net_device_stats structure, as described in the section "Statistics" in Chapter 12. The bridging code uses only a few fields:

tx_packets
tx_packets

tx_bytes
tx_bytes

tx_packets是本地生成并通过桥接设备传输的帧数。它由 更新br_dev_xmit。请注意,泛洪帧仅计数一次,即使它们退出所有启用的端口。tx_bytes,发送帧的大小总和tx_packets,也由 更新br_dev_xmit

tx_packets is the number of frames generated locally and transmitted over the bridge device. It is updated by br_dev_xmit. Note that flooded frames are counted only once, even though they exit all enabled ports. tx_bytes, the sum of the sizes of the tx_packets frames sent, is also updated by br_dev_xmit.

tx_dropped
tx_dropped

br_flood由于洪泛例程未能分配缓冲区而无法传输的帧数 。

Number of frames that could not be transmitted because the flood routine br_flood failed to allocate a buffer.

rx_packets
rx_packets

rx_bytes
rx_bytes

rx_packetsbr_pass_frame_up每次在本地传送桥接设备上接收到的入口帧时递增。rx_bytes 是 的对应项tx_bytes

rx_packets is incremented by br_pass_frame_up each time an ingress frame received on the bridge device is delivered locally. rx_bytes is the counterpart of tx_bytes.

这里引用的所有例程都在第 16 章中进行了描述。

All of the routines referenced here are described in Chapter 16.

STP 不保留任何统计数据。

No statistics are kept by the STP.

本书这一部分介绍的数据结构

Data Structures Featured in This Part of the Book

第 16 章中的“重要数据结构”部分简要概述了桥接代码使用的数据结构。本节提供了它们的逐字段描述。诸如和 之类的琐碎内容不需要专门的部分。mac_addrbr_config_bpdu

The section "Important Data Structures" in Chapter 16 provided a brief overview of the data structures used by the bridging code. This section provides a field-by-field description of them. The trivial ones, such as mac_addr and br_config_bpdu, do not need dedicated sections.

bridge_id 结构

bridge_id Structure

我们在第 15 章的“网桥和端口 ID ”部分中看到,网桥 ID 有两个组成部分:优先级和地址:

We saw in the section "Bridge and Port IDs" in Chapter 15 that bridge IDs have two components, the priority and the address:

unsigned char prio[2]
unsigned char prio[2]

桥优先级

Bridge priority

unsigned char addr[6]
unsigned char addr[6]

桥接MAC地址

Bridge MAC address

请注意,数据结构定义并不反映 802.1t 引入的更改。

Note that the data structure definition does not reflect the changes introduced by 802.1t.

net_bridge_fdb_entry 结构

net_bridge_fdb_entry Structure

这些是用于定义转发数据库中每个条目的字段:

These are the fields that are used to define each entry in the forwarding database:

struct hlist_node list
struct hlist_node list

用于将数据结构链接到存储桶的冲突元素列表中的指针。

Pointer used to link the data structure into the bucket's list of colliding elements.

struct net_bridge_port *dst
struct net_bridge_port *dst

桥港。

Bridge port.

struct rcu_head rcu
struct rcu_head rcu

使用读取-复制-更新 (RCU) 方案删除数据结构时使用(请参阅br_fdb_putnet /bridge/br_fdb.c)。

Used when removing the data structure using the read-copy-update (RCU) scheme (see br_fdb_put in net/bridge/br_fdb.c).

atomic_t use_count
atomic_t use_count

参考计数。请参阅第 30 章中的“查找”部分。

Reference count. See the section "Lookups" in Chapter 30.

unsigned long ageing_timer
unsigned long ageing_timer

老化计时器。内核的不同部分将其称为 老化老化请参阅第 16 章中的“老化”部分。

Aging timer. Different parts of the kernel spell this as aging or ageing. See the section "Aging" in Chapter 16.

mac_addr addr
mac_addr addr

MAC地址。这是查找例程使用的关键字段。

MAC address. This is the key field used by the lookup routines.

unsigned char is_local
unsigned char is_local

当该标志为 1 时,表示 MAC 地址addr 是在本地设备上配置的。

When this flag is 1, the MAC address addr is configured on a local device.

unsigned char is_static
unsigned char is_static

当该标志为 1 时,MAC 地址addr 是静态的并且不会过期。所有本地地址(即那些为is_local1 的地址)也是静态的。

When this flag is 1, the MAC address addr is static and it does not expire. All local addresses (i.e., those where is_local is 1) are static, too.

net_bridge_port 结构

net_bridge_port Structure

无论是否使用 STP,都会使用第一个字段块:

This first block of fields is used regardless of whether the STP is used:

struct net_bridge *br
struct net_bridge *br

struct net_device *dev
struct net_device *dev

br是桥接设备,dev是从属设备。看第 16 章中的图 16-6

br is the bridge device, and dev is the enslaved device. See Figure 16-6 in Chapter 16.

struct list_head list
struct list_head list

用于将数据结构链接到存储桶的冲突元素列表中的指针。

Pointer used to link the data structure into the bucket's list of colliding elements.

u8 state
u8 state

港口国。有效值列在include/linux/if_bridge.hBR_STATE_ XXX中。

Port state. Valid values are listed in include/linux/if_bridge.h with the BR_STATE_ XXX enumeration list.

struct kobject kobj
struct kobject kobj

由通用设备基础设施使用。该字段在我们在本节中看到的所有内容中发挥着核心作用“该字段在使我们在“通过 /sys 文件系统进行调整可能方面发挥着核心作用。

Used by the generic device infrastructure. This field plays a central role in making all that we saw in the section "Tuning via /sys Filesystem" possible.

struct rcu_head rcu
struct rcu_head rcu

用于使用 RCU 方案安全地销毁结构(参见del_nbpnet /bridge/br_if.c)。

Used to safely destroy the structure using the RCU scheme (see del_nbp in net/bridge/br_if.c).

仅当启用 STP 时才使用第二个块:

This second block is used only when the STP is enabled:

u8 priority
u8 priority

端口优先。

Port priority.

u16 port_no
u16 port_no

端口号。

Port number.

port_id port_id
port_id port_id

端口号。这是通过和br_make_port_id的组合来计算的priorityport_no计算的。

Port ID. This is computed with br_make_port_id as a combination of priority and port_no.

unsigned char topology_change_ack
unsigned char topology_change_ack

设置此标志后,必须在端口上传输的配置 BPDU 上设置 TCA 标志。

When this flag is set, the TCA flag must be set on configuration BPDUs transmitted on the port.

unsigned char config_pending
unsigned char config_pending

当配置 BPDU 因之前被保持定时器阻止而等待发送时,该标志为 1。

This flag is 1 when a configuration BPDU is waiting to be transmitted because it was previously held back by the Hold timer.

port_id designated_port
port_id designated_port

bridge_id designated_root
bridge_id designated_root

bridge_id designated_bridge
bridge_id designated_bridge

u32 designated_cost
u32 designated_cost

优先级向量的四个组成部分来自端口上收到的最新配置 BPDU(请参见第16 章中的图 16-8)。它们在收到每个带有 的配置 BPDU 后更新。br_record_config_configuration

The four components of the priority vector from the most recent configuration BPDU received on the port (see Figure 16-8 in Chapter 16). They are updated upon reception of each configuration BPDU with br_record_config_configuration.

u32 path_cost
u32 path_cost

端口路径成本。

Port path cost.

struct timer_list forward_delay_timer
struct timer_list forward_delay_timer

struct timer_list hold_timer
struct timer_list hold_timer

struct timer_list message_age_timer
struct timer_list message_age_timer

端口定时器。请参阅第 15 章中的“定时器”部分。

Port timers. See the section "Timers" in Chapter 15.

net_bridge结构

net_bridge Structure

无论是否使用 STP,都会使用第一个字段块:

This first block of fields is used regardless of whether the STP is in use:

spinlock_t lock
spinlock_t lock

锁用于序列化对net_bridge结构或其端口之一的更改port_list。只读访问使用rcu_read_lockrcu_read_unlock 原语。

Lock used to serialize changes to the net_bridge structure or to one of its ports in port_list. Read-only accesses use the rcu_read_lock and rcu_read_unlock primitives.

struct list_head port_list
struct list_head port_list

桥接端口列表。

List of bridge ports.

struct net_device *dev
struct net_device *dev

桥接装置(参见第16章16-6)。

Bridge device (see Figure 16-6 in Chapter 16).

struct net_device_stats statistics
struct net_device_stats statistics

统计数据。请参阅“统计”部分。

Statistics. See the section "Statistics."

spinlock_t hash_lock
spinlock_t hash_lock

struct hlist_head hash[BR_HASH_SIZE]
struct hlist_head hash[BR_HASH_SIZE]

hash是转发数据库。hash_lock是用于序列化对其条目的读写访问的锁。只读访问使用rcu_read_lockrcu_read_unlock 原语。

hash is the forwarding database. hash_lock is the lock used to serialize read-write accesses to its entries. Read-only accesses use the rcu_read_lock and rcu_read_unlock primitives.

struct list_head age_list
struct list_head age_list

不曾用过。该列表用于将转发数据库的所有条目按照最近使用的升序链接在一起(参见第16章中的图16-6)。老化算法使用此列表来扫描数据库以查找过期条目。

Not used. This list used to be employed to link together all the entries of the forwarding database in ascending order of most recent use (see Figure 16-6 in Chapter 16). This list was used by the aging algorithm to scan the database for expired entries.

unsigned long ageing_time
unsigned long ageing_time

条目可以在转发数据库中保留而不被使用的最长时间。请参阅第 16 章中的“老化” 部分。

Maximum time an entry can stay in the forwarding database without being used. See the section "Aging" in Chapter 16.

struct kobject ifobj
struct kobject ifobj

由通用设备基础设施使用。该字段在使我们在“通过 /sys 文件系统进行调整”部分中看到的所有内容成为可能方面发挥着核心作用。

Used by the generic device infrastructure. This field plays a central role in making all that we saw in the section "Tuning via /sys Filesystem" possible.

unsigned char stp_enabled
unsigned char stp_enabled

设置此标志后,将为网桥启用 STP。

When this flag is set, the STP is enabled for the bridge.

仅当使用 STP 时才使用下一个字段块。唯一的例外是 forward_delay,无论如何都会使用它。启用S​​TP后,桥接端口不会立即被分配转发状态;他们使用 forward_delay计时器来经历中间状态。

The next block of fields is used only when the STP is in use. The only exception is forward_delay, which is used regardless. Bridge ports are not assigned the forwarding state as soon as STP is enabled; they use the forward_delay timer to go through the intermediate states.

bridge_id designated_root
bridge_id designated_root

根桥的ID。

Root bridge's ID.

bridge_id bridge_id
bridge_id bridge_id

桥ID。

Bridge ID.

u32 root_path_cost
u32 root_path_cost

到根桥的最佳路径的成本。

Cost of the best path to the root bridge.

unsigned long max_age
unsigned long max_age

unsigned long hello_time
unsigned long hello_time

unsigned long forward_delay
unsigned long forward_delay

桥接定时器。这些值在根桥上配置,并通过br_record_config_timeout_values根端口上接收配置 BPDU 保存在本地。

Bridge timers. These values are configured on the root bridge and are saved locally by br_record_config_timeout_values with the reception of configuration BPDUs on the root port.

unsigned long bridge_max_age
unsigned long bridge_max_age

unsigned long bridge_hello_time
unsigned long bridge_hello_time

unsigned long bridge_forward_delay
unsigned long bridge_forward_delay

本地配置的桥定时器。这些仅由根桥使用。

Bridge timers configured locally. These are used only by the root bridge.

u16 root_port
u16 root_port

根端口的端口号。

Port number of the root port.

unsigned char topology_change
unsigned char topology_change

当根端口上收到的最新配置 BPDU 设置了 TC 标志时,会设置此标志。设置后topology_change,必须在网桥发送的任何配置 BPDU 上设置 TC 标志。请参阅第 15 章中的“拓扑更改示例”部分。

This flag is set when the latest configuration BPDU received on the root port had the TC flag set. When topology_change is set, the TC flag must be set on any configuration BPDU transmitted by the bridge. See the section "Example of a Topology Change" in Chapter 15.

unsigned char topology_change_detected
unsigned char topology_change_detected

当检测到拓扑更改时设置此标志。有关被视为可能的拓扑更改的条件,请参阅第 15 章中的“拓扑更改”部分。

This flag is set when a topology change has been detected. See the section "Topology Changes" in Chapter 15 for the conditions that are considered possible topology changes.

struct timer_list hello_timer
struct timer_list hello_timer

struct timer_list tcn_timer
struct timer_list tcn_timer

struct timer_list topology_change_timer
struct timer_list topology_change_timer

桥接定时器。请参阅第 15 章中的“定时器”部分。

Bridge timers. See the section "Timers" in Chapter 15.

struct timer_list gc_timer
struct timer_list gc_timer

转发数据库的垃圾收集计时器。请参阅第 16 章中的“老化”部分。

Garbage collection timer for the forwarding database. See the section "Aging" in Chapter 16.

本书这一部分介绍的函数和变量

Functions and Variables Featured in This Part of the Book

表17-4总结了第四部分中介绍的主要函数、变量和数据结构 。在第 16 章的“关键生成树例程”和“定时器”部分中,您可以找到更多内容。

Table 17-4 summarizes the main functions, variables, and data structures introduced in Part IV. In the sections "Key Spanning Tree Routines" and "Timers" in Chapter 16, you can find some more.

表 17-4。第四部分介绍的函数、变量和数据结构

Table 17-4. Functions, variables, and data structures introduced in Part IV

姓名

Name

描述

Description

功能

Functions

 

br_init br_deinit

br_init br_deinit

初始化并清理内核桥接模块。请参阅第 16 章中的“桥接代码的初始化”部分。

Initialize and clean up the kernel bridging module. See the section "Initialization of Bridging Code" in Chapter 16.

br_fdb_init

br_fdb_init

初始化转发数据库。

Initialize the forwarding database.

br_netfilter_init

br_netfilter_init

初始化桥接代码使用的 Netfilter 挂钩。

Initialize the Netfilter hooks used by the bridging code.

br_stp_timer_init br_stp_port_timer_init

br_stp_timer_init br_stp_port_timer_init

初始化桥接和桥接端口定时器。

Initialize the bridge and bridge port timers.

br_sysfs_addbr br_sysfs_delbr

br_sysfs_addbr br_sysfs_delbr

处理sysfs中桥接设备的额外文件。请参阅“通过 /sys 文件系统进行调整”部分。

Handle the extra files in sysfs for bridge devices. See the section "Tuning via /sys Filesystem."

br_sysfs_addif br_sysfs_removeif

br_sysfs_addif br_sysfs_removeif

处理sysfs中用于桥接端口的额外文件。请参阅“通过 /sys 文件系统进行调整”部分。

Handle the extra files in sysfs for bridge ports. See the section "Tuning via /sys Filesystem."

br_add_bridge br_del_bridge

br_add_bridge br_del_bridge

创建和删除桥接设备。请参阅第 16 章中的“创建桥接设备和桥接端口”部分。

Create and delete a bridge device. See the section "Creating Bridge Devices and Bridge Ports" in Chapter 16.

br_add_if br_del_if

br_add_if br_del_if

创建和删除桥接端口。请参阅第 16 章中的“创建桥接设备和桥接端口”部分。

Create and delete a bridge port. See the section "Creating Bridge Devices and Bridge Ports" in Chapter 16.

br_stp_recalculate_bridge

br_stp_recalculate_bridge

给定一个网桥,在网桥端口(即从属设备)上配置的 MAC 地址中选择数字最小的 MAC 地址,并使用它来计算网桥 ID。

Given a bridge, select the numerically lowest MAC address among the ones configured on the bridge ports (i.e., enslaved devices) and use it to compute the bridge ID.

br_min_mtu

br_min_mtu

给定一个网桥,找到网桥端口上配置的 MTU 中最低的 MTU。

Given a bridge, find the lowest MTU among the ones configured on the bridge ports.

br_stp_enable_bridge br_stp_disable_bridge

br_stp_enable_bridge br_stp_disable_bridge

启用和禁用桥接设备。请参阅第 16 章中的“启用和禁用桥实例”部分。

Enable and disable a bridge device. See the section "Enabling and disabling a bridge instance" in Chapter 16.

br_stp_enable_port br_stp_disable_port

br_stp_enable_port br_stp_disable_port

启用和禁用桥接端口。请参阅第 16 章中的“启用和禁用桥接端口”部分。

Enable and disable a bridge port. See the section "Enabling and Disabling a Bridge Port" in Chapter 16.

_ _br_fdb_get br_fdb_get

_ _br_fdb_get br_fdb_get

在转发数据库中查找条目。请参阅第 30 章中的“查找”部分。

Look up an entry in the forwarding database. See the section "Lookups" in Chapter 30.

fdb_create br_fdb_insert bf_fdb_change_addr br_fdb_update br_fdb_cleanup

fdb_create br_fdb_insert bf_fdb_change_addr br_fdb_update br_fdb_cleanup

操作转发数据库的各种例程。请参阅第 16 章及其小节中的“转发数据库” 部分。

Various routines to manipulate the forwarding database. See the section "Forwarding database" in Chapter 16 and its subsections.

handle_bridge br_handle_frame br_handle_frame_finish br_stp_handle_bpdu br_forward br_flood br_pass_frame_up br_pass_frame_up_finish

handle_bridge br_handle_frame br_handle_frame_finish br_stp_handle_bpdu br_forward br_flood br_pass_frame_up br_pass_frame_up_finish

用于处理入口帧的各种例程。请参阅第 16 章中的“处理入口流量” 部分。

Various routines used to handle ingress frames. See the section "Handling Ingress Traffic" in Chapter 16.

br_received_config_bpdu br_received_tcn_bpdu

br_received_config_bpdu br_received_tcn_bpdu

分别处理入口配置和TCN BPDU。请参阅第 16 章中的“处理入口 BPDU ”部分。

Process an ingress configuration and TCN BPDU, respectively. See the section"Processing Ingress BPDUs" in Chapter 16.

br_transmit_config br_transmit_tcn br_reply br_send_bpdu

br_transmit_config br_transmit_tcn br_reply br_send_bpdu

各种传输例程。请参见第16 章中的“发送BPDU ”部分。

Various transmission routines. See the section "Transmitting BPDUs" in Chapter 16.

br_make_blocking br_make_forwarding

br_make_blocking br_make_forwarding

br_make_blocking阻塞端口,并将 br_make_forwarding转发状态分配给端口,允许其接收和传输数据流量。

br_make_blocking blocks a port, and br_make_forwarding assigns the forwarding state to a port, allowing it to receive and transmit data traffic.

br_get_tick br_set_tick

br_get_tick br_set_tick

读取和写入时间间隔,注意 1/256 秒(在配置 BPDU 中使用)和刻度(Linux 使用)之间的转换。

Read and write a time interval, taking care of the conversion between 1/256th of a second (used in the configuration BPDUs) and ticks (used by Linux).

变量

Variables

 

BR_MAX_PORTS

BR_MAX_PORTS

可以添加到桥接设备的最大桥接端口数。

Maximum number of bridge ports that can be added to a bridge device.

br_handle_frame_hook

br_handle_frame_hook

函数指针初始化为桥接子系统中用于处理入口帧的例程。请参见第 16 章中的图 16-11

Function pointer initialized to the routine used in the bridging subsystem to process ingress frames. See Figure 16-11 in Chapter 16.

br_fdb_cache

br_fdb_cache

用于分配转发数据库元素的缓存。

Cache used for the allocation of elements of the forwarding databases.

数据结构

Data structures

 

struct mac_addr struct bridge_id struct bridge_fdb_entry struct net_bridge_port struct net_bridge struct br_config_bpdu

struct mac_addr struct bridge_id struct bridge_fdb_entry struct net_bridge_port struct net_bridge struct br_config_bpdu

桥接代码使用的主要数据结构。请参阅第 16 章中的“重要数据结构” 部分。

Main data structures used by the bridging code. See the section "Important data structures" in Chapter 16.

本书这一部分介绍的文件和目录

Files and Directories Featured in This Part of the Book

图 17-3列出了第 IV 部分各章中提到的文件和目录。

Figure 17-3 lists the files and directories referred to in the chapters in Part IV.

本书这一部分中的文件和目录

图 17-3。本书这一部分中的文件和目录

Figure 17-3. Files and directories featured in this part of the book




[ * ]change_ack是 的快捷方式topology_change_ack

[*] change_ack is a shortcut for topology_change_ack.

第五部分:互联网协议版本 4 (IPv4)

Part V. Internet Protocol Version 4 (IPv4)

Linux 内核支持许多第三层 (L3) 协议,例如 AppleTalk、DECnet 和 IPX,但本书仅讨论主导现代网络的协议:IP。虽然将详细描述 IPv4,但将仅根据需要简要提及 IPv6。我不会花太多时间讨论这些协议背后的理论,您应该对此有些熟悉,但我将描述 Linux 中的实现。我将重点关注设计中不明显或与其他操作系统显着不同的方面。我还将解释 IP 协议第 4 版的主要缺点,并展示 IPv6 如何尝试解决这些缺点。因此,虽然既有一些背景理论,也有一些代码,但我希望读者能够熟悉基本的 IP 协议行为。

The Linux kernel supports many Layer three (L3) protocols, such as AppleTalk, DECnet, and IPX, but this book talks just about the one that dominates modern networking: IP. While IPv4 will be described in detail, IPv6 will be only briefly mentioned as needed. I will not spend much time on the theory behind these protocols, with which you should be somewhat familiar, but I will describe the implementation in Linux. I will focus on aspects of the design that are not obvious or that differ substantially from other operating systems. I will also explain the main drawbacks of version 4 of the IP protocol and show how IPv6 tries to address them. Therefore, while there is both some background theory and some code, I expect the reader to be familiar with the basic IP protocol behavior. Here is what is covered in each chapter:

第 18 章互联网协议版本 4 (IPv4):概念
Chapter 18, Internet Protocol Version 4 (IPv4): Concepts

介绍IP层的主要任务以及所使用的策略。

Introduces the major tasks of the IP layer, and the strategies used.

第 19 章 Internet 协议版本 4 (IPv4):Linux 基础和功能
Chapter 19 Internet Protocol Version 4 (IPv4): Linux Foundations and Features

显示 IP 层接收例程如何处理入口数据包,以及如何处理 IP 选项。

Shows how the IP-layer reception routine processes ingress packets, and how IP options are taken care of.

第 20 章 Internet 协议版本 4 (IPv4):转发和本地传送
Chapter 20 Internet Protocol Version 4 (IPv4): Forwarding and Local Delivery

显示入口 IP 数据包如何在本地传递到 L4 协议处理程序,或者当目标 IP 地址不属于本地主机但主机已启用转发时转发。

Shows how ingress IP packets are delivered locally to the L4 protocol handler, or are forwarded when the destination IP address does not belong to the local host but the host has enabled forwarding.

第 21 章 互联网协议版本 4 (IPv4):传输
Chapter 21 Internet Protocol Version 4 (IPv4): Transmission

显示 L4 协议如何连接到 IP 层以请求传输。

Shows how L4 protocols interface to the IP layer to request transmission.

第 22 章 Internet 协议版本 4 (IPv4):处理碎片
Chapter 22 Internet Protocol Version 4 (IPv4): Handling Fragmentation

显示如何处理碎片和碎片整理。

Shows how fragmentation and defragmentation are handled.

第 23 章 Internet 协议版本 4 (IPv4):其他主题
Chapter 23 Internet Protocol Version 4 (IPv4): Miscellaneous Topics

显示配置工具(例如 IPROUTE2 包中的配置工具)如何连接到内核,显示如何在出口数据包上初始化 IP 标头的 ID 字段,并提供 IP 层使用的数据结构的详细描述。

Shows how configuration tools such as those in the IPROUTE2 package interface to the kernel, shows how the IP header's ID field is initialized on egress packets, and provides a detailed description of the data structures used at the IP layer.

第 24 章 第四层协议和原始 IP 处理
Chapter 24 Layer Four Protocol and Raw IP Handling

显示 L4 协议如何注册入口流量的处理程序。

Shows how L4 protocols register a handler for ingress traffic.

第 25 章 互联网控制消息协议 (ICMPv4)
Chapter 25 Internet Control Message Protocol (ICMPv4)

描述ICMP协议的实现。

Describes the implementation of the ICMP protocol.

第 18 章 Internet 协议版本 4 (IPv4):概念

Chapter 18. Internet Protocol Version 4 (IPv4): Concepts

本章解释了 IP 协议的职责,并讨论了支持这些活动的 IP 标头字段以及这些职责对可能实现的影响。虽然本章讨论了 Linux 中所做的一些选择,但后续章节将介绍实现细节。

This chapter explains what the IP protocol is responsible for, and provides a discussion of the IP header fields that support these activities and the impact of these responsibilities on possible implementations. While the chapter discusses some of the choices made in Linux, implementation details are covered in subsequent chapters.

展示 IPsec 安全套件的协议如何与 IP 协议集成将会很有趣,但由于篇幅有限,我无法包含此主题。然而,我们有时会看到 IPsec 转换的存在如何影响核心例程的实现。

It would be interesting to show how the protocols of the IPsec security suite have been integrated with the IP protocol, but I could not include this topic for lack of space. However, we will sometimes see how the presence of IPsec transformations influences the implementation of core routines.

IP 协议:总体情况

IP Protocol: The Big Picture

图 18-1 显示了处理 IPv4 的 Linux 组件之间的重要关系。主要功能之间的流量由箭头表示。我们将在接下来的几章中分析所有这些函数。该图显示了其他地方描述的两个子系统(邻居子系统和流量控制子系统)的布局,以及可以调用 Netfilter 防火墙系统的许多挂钩。[ * ]

Figure 18-1 shows the important relationships among the components of Linux that handle IPv4. The flow of traffic between major functions is represented by arrows. We will analyze all of these functions in the next few chapters. The figure shows the placement of two subsystems described elsewhere—the Neighboring subsystem and the Traffic Control subsystem—as well as the many hooks where the Netfilter firewalling system can be invoked.[*]

当您检查网络代码并想知道某个特定函数是否用于输入或输出、是否在转发期间调用以及谁调用它时,图 18-1 是一个有用的参考

Figure 18-1 is a useful reference when you're examining networking code and wondering whether a particular function is used for input or output, whether it is called during forwarding, and who calls it.

由于 IP 层不直接与流量控制子系统交互,因此将该子系统留给第六部分然而,在第21章的“与相邻子系统的接口”部分中,我们将看到IP和相邻子系统如何交互。

Since the IP layer does not interact directly with the Traffic Control subsystem, that subsystem is left to Part VI. However, in the section "Interface to the Neighboring Subsystem" in Chapter 21, we will see how IP and the Neighboring subsystem interact.

IP内核堆栈的核心功能

图 18-1。IP内核堆栈的核心功能

Figure 18-1. Core functions of the IP kernel stack

IP 协议的任务包括:

Among the tasks of the IP protocol are:

健全性检查
Sanity checks

由于校验和不正确(即传输已损坏数据)、报头字段超出范围或其他原因,IP 数据报可能在进入系统后立即被丢弃。

IP datagrams could be discarded immediately upon entering the system, because of an incorrect checksum (that is, transmission has corrupted it), a header field out of range, or other reasons.

防火墙
Firewalling

如图 18-1所示,Netfilter 防火墙子系统(由iptables命令在用户端控制)可以在数据包历史记录中的许多点上调用,并且可以改变其命运。正如我们将在第五部分中看到的,Netfilter 也可以在 L2 中使用。

As shown in Figure 18-1, the Netfilter firewall subsystem (controlled on the user side by the iptables command) can be invoked at many points in the packet's history and can change its destiny. As we will see in Part V, Netfilter can be used at L2 as well.

处理选项
Handling options

IP 协议包括一些应用程序可以使用的选项。尽管原始 IP RFC (791) 表示选项的实现对于主机和路由器都是强制性的,但实际上并非所有选项都已实现。有些被普遍认为是过时的,而另一些则仅在特殊情况下使用。

The IP protocol includes a few options that applications can use. Even though the original IP RFC (791) says the implementation of options is mandatory for both hosts and routers, not all of them are actually implemented. Some are universally recognized as obsolete, and others are used only in special cases.

碎片/碎片整理
Fragmentation/defragmentation

lenIP 标头 足够长,允许数据报大小达到 64 KB,但它们几乎永远不会达到该限制。事实上,MTU 值从网络的一个部分到另一部分有所不同,具体取决于用于传输的媒体,[ * ],因此数据包很可能对于沿途的某一跳而言太大。在这种情况下,数据包必须被分割成更小的部分才能成功传输。每个片段在到达目的地之前可以进一步分段,目的地必须重新组装片段。现在不鼓励使用碎片,因为它会带来问题。我们将在“数据包碎片/碎片整理。”

The len field of the IP header is long enough to allow datagrams up to 64 KB in size, but they almost never reach that limit. In fact, MTU values vary from one part of the network to another depending on the media used for transmission,[*] so it is quite possible that a packet will be too big for one of the hops along the way. In such cases, the packet has to be split into smaller pieces to be successfully transmitted. Each fragment can be further fragmented before arriving at the destination, which must reassemble the fragments. The use of fragmentation is discouraged nowadays because it introduces problems. We will see them in the section "Packet Fragmentation/Defragmentation."

接收、发送和转发操作
Receive, transmit, and forward operations

输入数据包由接收功能处理,输出数据包由传输功能处理。转发与传输相关,但处理从其他主机接收的数据包,而不是本地系统上更高网络层生成的数据包。

Input packets are handled by reception functions, and output packets by transmission functions. Forwarding is related to transmission, but deals with packets received from other hosts instead of packets generated by higher network layers on the local system.

我在第24章中简要介绍了Raw IP协议,在第23章中简要介绍了IP-over-IP(也称为IP隧道)。

I briefly introduce the Raw IP protocol in Chapter 24 and IP-over-IP (also called IP tunneling) in Chapter 23.

IP标头

IP Header

读者可能熟悉 IP 标头的基本字段,但有一些参数并不为人所知,还有一些参数的含义随着时间的推移而发生了变化。图 18-2显示了标题,下面的文本总结了它们的用途:

Readers might be familiar with the basic fields of the IP header, but there are a few parameters that are not well known and some others that have changed in meaning over time. Figure 18-2 shows the header, and the text that follows summarizes their purposes:

版本
Version

协议的版本。目前仅实现了版本 4 和 6。本章介绍版本 4。本书未涵盖版本 6,尽管我们有时会在上下文中有用时提及 IPv6 与 IPv4 的不同之处。

IP标头

图 18-2。IP标头

Version of the protocol. Currently only versions 4 and 6 are implemented. Version 4 is described in this chapter. Version 6 is not covered in this book, although we will sometimes mention how IPv6 differs from IPv4 when this is useful in context.

Figure 18-2. IP header

标头长度 (IHL)
Header Length (IHL)

报头的长度,以32位为单位表示。

Length of the header, expressed in units of 32 bits.

服务类型 (TOS)
Type of Service (TOS)

该8位字段由三个子字段组成。我不会详细介绍它们,因为出于多种原因,它们的用途非常有限。最初,该字段旨在通过告诉路由器数据包发送方认为哪些标准最重要来促进服务质量 (QoS) 功能:最小延迟、最大吞吐量等。TOS 字段仍然可以以这种方式使用,但互联网研究人员发现它太模糊,并决定以不同的方式实现 QoS。因此,他们引入了差异化服务[ * ](diffserv)模型,改变了该领域的结构和含义。与 diffserv 模型相关的新含义如下所示图18-3(b)。DSCP 代表 DiffServ 代码点。每个可能的值对于如何处理数据包都有独特且特定的含义。TOS字段中以前未使用的两个位现在由显式拥塞通知(ECN)功能使用,如图18-3(c)所示。用于读取和操作 IP 和 TCP 标头中的 ECN 标志的大部分代码位于 include/net/inet_ecn.hinclude/net/tcp_ecn.h中。有关详细信息,请参阅图 18-3中的 RFC 。

This 8-bit field is composed of three subfields. I will not go into detail about them because their use is very limited, for many reasons. Originally this field was meant to facilitate Quality of Service (QoS) features by telling routers which criteria were considered most important by the packet's sender: minimum delay, maximum throughput, and so on. The TOS field can still be used in this way, but Internet researchers found it too vague and have decided to implement QoS differently. Therefore, they introduced the Differentiated Services[*] (diffserv) model, changing the structure and meaning of the field. The new meaning associated with the diffserv model is shown in Figure 18-3(b). DSCP stands for DiffServ Code Point. Each possible value has a unique and specific meaning for how the packet should be treated. The two formerly unused bits of the TOS field are now used by the Explicit Congestion Notification (ECN) feature, as shown in Figure 18-3(c). Most of the code used to read and manipulate the ECN flags in the IP and TCP headers is located in include/net/inet_ecn.h and include/net/tcp_ecn.h. Refer to the RFCs in Figure 18-3 for more detail.

总长度
Total Length

数据包的长度,包括标头,以字节表示。

Length of the packet, including the header, expressed in bytes.

鉴别
Identification

数据包的标识符。正如我们将在本章后面看到的,该字段在片段处理中起着核心作用。

IP头TOS字段的新旧含义

图 18-3。IP头TOS字段的新旧含义

Identifier of the packet. As we will see later in this chapter, this field plays a central role in the handling of fragments.

Figure 18-3. Old and new meanings of the TOS field of the IP header

DF(不分段)
DF (Don't Fragment)

MF(更多片段)
MF (More Fragments)

片段偏移
Fragment Offset

这三个字段与 一起Identification由 IP 协议的分段/碎片整理功能使用。请参阅“数据包碎片/碎片整理”部分。

These three fields, together with Identification, are used by the fragmentation/defragmentation feature of the IP protocol. See the section "Packet Fragmentation/Defragmentation."

生存时间 (TTL)
Time To Live (TTL)

该字段应该表示自 IP 数据包传输以来的秒数,在此之后,如果尚未到达最终目的地,则将其丢弃。然而,由于路由器将其减一,因此无论转发它所花费的时间如何,它实际上代表了一个简单的跳数。每个路由器在转发数据包时都应该减少该字段,并且当 TTL 达到零时应该丢弃该数据包。理论上,其初始值(由发送方设置)取决于所承载的有效负载类型。负载对端到端延迟越敏感,TTL值应该越小。然而,大多数时候,使用默认值 64(请参见包括/linux/ip.h)。[ * ]数据包不会静默丢弃:通过 Internet 控制消息协议 (ICMP) 消息向源发出警告。

This field is supposed to represent the number of seconds since the IP packet was transmitted, after which it is to be discarded if it has not reached the final destination. However, because routers decrement it by one, regardless of the time they take to forward it, it actually represents a simple hop count. Each router is supposed to decrement this field when it forwards a packet, and the packet is supposed to be dropped when the TTL reaches zero. Its initial value (set by the sender) in theory depends on the type of payload carried. The more sensitive the payload is to end-to-end delay, the smaller the TTL value should be. Most of the time, however, a default value of 64 is used (see include/linux/ip.h).[*] Packets are not dropped silently: the source is warned through an Internet Control Message Protocol (ICMP) message.

协议
Protocol

该字段表示高层(L4)的协议标识符。文件 /etc/protocols [ ]包含部分列表。您可以在http://www.iana.org/numbers.html找到更多详细信息。在第 24 章中,我们将看到 IP 层如何使用它将入口数据包传递给正确的协议处理程序。

This field represents the protocol identifier of the higher layer (L4). The file /etc/protocols [] contains a partial list. You can find more details at http://www.iana.org/numbers.html. In Chapter 24, we will see how the IP layer uses it to hand the ingress packets to the right protocol handler.

标头校验和
Header Checksum

确保 IP 标头在传输后准确无误。不覆盖数据包的有效负载;如有必要,由 L4 协议负责检查内容。

Ensures that the IP header is accurate after transit. Does not cover the packet's payload; it is up to the L4 protocol to take care of checking the content, if necessary.

源地址
Source Address

目的地地址
Destination Address

源(发送者)和目标(接收者)IP 地址。

Source (sender) and destination (receiver) IP addresses.

选项
Options

包含下一节中讨论的可选信息。该字段可以为空或最多 40 个字节长。它的大小是报头长度减去 20(20 是不带选项的 IP 报头的大小) )。最大值为 40,因为标头长度是 4 位值,并且以 32 位(4 字节)为单位表示标头大小。4位可以表示的最高值是15,4字节的15倍就是60字节。由于基本IP头占用了20个字节,因此只剩下40个字节用于选项。

Contains the optional information discussed in the following section. This field could be empty or up to 40 bytes long. Its size is the header length minus 20 (20 being the size of an IP header without options ). The maximum value is 40 because the header length is a 4-bit value and represents the header size in units of 32 bits (4 bytes). The highest value that can be represented in 4 bits is 15, and 15 times 4 bytes is 60 bytes. Since 20 bytes are taken up by the basic IP header, only 40 are left for the options.

IP选项

IP Options

如本章前面所述,网络堆栈需要实现许多 IP 选项,应用程序可以选择使用这些选项。为了容纳与选项相关的信息,基本的 20 字节 IP 标头被扩展至另外 40 字节。

As described earlier in this chapter, network stacks are required to implement a number of IP options that applications can use if they choose to. To accommodate information related to options, the basic 20-byte IP header is extended up to another 40 bytes.

大多数 IP 选项很少使用,并且在特定情况下使用。不同的选项可以组合到同一个IP数据包中。然而,除了“选项列表结束”和“无操作”选项之外,标头中的每个选项最多可以有一个实例。选项的存在也会影响碎片/碎片整理过程,正如我们将在“数据包碎片/碎片整理”部分中看到的那样。

Most IP options are used very rarely, and in particular contexts. Different options can be combined into the same IP packet. However, with the exception of the "End of Option List" and "No Operation" options, there can be at most one instance of each option in a header. The presence of options also influences the fragmentation/defragmentation process, as we will see in the section "Packet Fragmentation/Defragmentation."

有些选项非常简单,可以用单个字节来指定;更复杂的选项需要更灵活的格式,称为多字节选项

Some options are very simple and can be specified by a single byte; more complex options require a more flexible format and are called multibyte options.

图 18-4显示了两种选项的格式。请注意,多字节选项中的选项数据不从 32 位边界开始。

Figure 18-4 shows the format of both kinds of options. Note that the option data in a multibyte option does not start at a 32-bit boundary.

(a) 单一 IP 选项格式; (b) 多字节IP选项格式

图 18-4。(a) 单一 IP 选项格式;(b) 多字节IP选项格式

Figure 18-4. (a) Single IP option format; (b) multibyte IP option format

每个选项都有一个8位字段named type,可以进一步分解为三个子字段,如图18-5所示。最常见的值 type列于表 18-1中。[ * ]它显示了 Linux 内核用于选项的符号以及符号的值如何分解为图 18-5中的三个字段。

Each option has an 8-bit field named type that can be further decomposed into three subfields, shown in Figure 18-5. The most common values for type are listed in Table 18-1.[*] It shows the symbols used for options by the Linux kernel and how the value of the symbol breaks down into the three fields in Figure 18-5.

IP选项类型字段的格式

图 18-5。IP选项类型字段的格式

Figure 18-5. Format of the type field of the IP options

设置后copied,当数据包需要分段时,IP 层必须将选项复制到每个分段中。class根据四个标准对选项进行分类;这些可用于根据 IP 选项过滤数据包,或对这些数据包应用不同的 QoS 参数。

When copied is set, the IP layer must copy the option into each fragment when the packet needs fragmentation. class classifies the option according to four criteria; these can be used to filter packets based on IP options, or to apply different QoS parameters to these packets.

表 18-1。IP选项类型字段的子码值

Table 18-1. Values of the subcodes of the IP option type field

选项

Option

内核源代码中使用的符号

Symbol used in kernel source code

数字

Number

已复制

Copied

类别控制(00)/保留(01)/测量(10)/保留(11)

Class Control(00) / Reserved(01) / Measurement(10) / Reserved(11)

选项列表结束

End of Options List

IPOPT_END

IPOPT_END

0

0

0

0

控制

Control

无操作

No Operation

IPOPT_NOOP

IPOPT_NOOP

1

1

0

0

控制

Control

安全

Security

IPOPT_SEC

IPOPT_SEC

2

2

1

1

控制

Control

松散源和记录路径

Loose Source and Record Route

IPOPT_LSRR

IPOPT_LSRR

3

3

1

1

控制

Control

时间戳

Timestamp

IPOPT_TIMESTAMP

IPOPT_TIMESTAMP

4

4

0

0

测量

Measurement

记录路线

Record Route

IPOPT_RR

IPOPT_RR

7

7

0

0

控制

Control

码流ID

Stream ID

IPOPT_SID

IPOPT_SID

8

8

1

1

控制

Control

严格的来源和记录路径

Strict Source and Record Route

IPOPT_SSRR

IPOPT_SSRR

9

9

1

1

控制

Control

路由器警报

Router Alert

IPOPT_RA

IPOPT_RA

20

20

1

1

控制

Control

include/linux/ip.h中,您可以找到选项类型的定义,以及一些可用于访问其子字段的宏。例如,以下三个宏可分别用于提取numbercopiedclass部分:IPOPT_NUMBERIPOTP_COPIED[ ]IPOPT_CLASS

In include/linux/ip.h, you can find the definitions of the option types, plus some macros that can be used to access their subfields. For instance, the following three macros can be used to extract the number, copied, and class portions, respectively: IPOPT_NUMBER, IPOTP_COPIED,[] and IPOPT_CLASS.

图 18-4(b)中显示的多字节选项使用的附加字段是:

The additional fields shown in Figure 18-4(b), used by multibyte options, are:

长度
Length

选项的长度(以八位字节为单位),包括typelength

Length of the option in octects, including type and length.

指针
Pointer

从选项开头开始测量的偏移量,并在主机处理选项的过程中以各种方式使用。您将在接下来的部分中看到一些示例。编号从 1 开始,而不是从 0 开始(即 1 标识字段的位置 type)。

An offset measured from the beginning of the option and used in various ways as hosts process the option along the way. You will see some examples in upcoming sections. The numbering starts from 1, not 0 (i.e., 1 identifies the location of the type field).

选项数据
Option data

用于处理该选项的中间主机必须存储的任何数据的空间。稍后您将看到一些示例。

Space for any data that must be stored by intermediate hosts that process the option. You will see some examples later.

在接下来的小节中,我们将看到表 18-1中Linux 处理的选项如何工作。

In the next subsections, we will see how the options in Table 18-1 that are handled by Linux work.

“选项列表结束”和“无操作”选项

"End of Option List" and "No Operation" Options

不带选项的 IP 标头大小为 20 字节。当 IP 选项的大小不是 4 字节的倍数时,发送方将使用选项填充 IP 标头,IPOPT_END以使其与 4 字节边界对齐。这是必要的,因为 IP 报头的报头长度字段以 4 字节的倍数表示。

The size of the IP header without options is 20 bytes. When the size of the IP options is not a multiple of 4 bytes, the sender pads the IP header with the IPOPT_END option to align it to a 4-byte boundary. This is necessary because the Header Length field of the IP header is expressed in multiples of 4 bytes.

IPOPT_NOOP选项可用于选项之间的填充,例如,将后续 IP 选项与给定边界对齐。在 第 19 章中,我们将看到 Linux 还使用它作为从 IP 标头中删除选项的便捷方法。

The IPOPT_NOOP option can be used for padding between options, for example, to align the subsequent IP option to a given boundary. In Chapter 19, we will see that Linux uses it also as a convenient way to delete options from an IP header.

源路由选项

Source Route Option

源路由允许发送者指定数据包到达接收者的路径。源路由的一种L2 和 L3 均可用;我将在这里讨论 L3 实现。

Source routing allows a sender to specify the path that a packet takes to its recipient. A type of source routing is available at both L2 and L3; I'll discuss the L3 implementation here.

源路由是一个多字节选项,其中源节点列出要在后续跃点上使用的 IP 地址。当然,如果列表中的路由器之一发生故障,源路由数据包将无法从路由协议上完成的任何动态重新路由中受益。通常,当路由器出现故障时,更高级别的协议会计算新的源路由并重新发送数据包。有时,可能出于安全原因,他们不被允许指定新路线。

Source Routing is a multibyte option in which the source node lists IP addresses to be used on subsequent hops. Of course, if one of the routers in the list goes down, the source-routed packet will not be able to benefit from any dynamic rerouting done on routing protocols. Usually, when a router goes down, the higher-level protocols compute a new source route and resend the packet. Occasionally, they are not allowed to specify a new route, perhaps for security reasons.

源路由可以有两种类型:严格和松散。在严格源路由中,发送方必须列出路径上每个路由器的 IP 地址,并且沿途不能进行任何更改。在松散源路由中,中间路由器之一可以使用列表中未指定的另一个路由器作为到达列表中下一个路由器的方式。然而,发送者指定的所有路由器仍必须按指定的顺序使用。

Source routing can be of two types: strict and loose. In strict source routing, the sender has to list the IP addresses of every router along the path, and no changes can be made along the way. In loose source routing, one of the intermediate routers can use another router, not specified in the list, as a way to get to the next router in the list. However, all of the routers specified by the sender must still be used in the order specified.

例如,考虑图 18-6中的网络和路由器。假设主机 X 要使用严格源路由选项向主机 Y 发送数据包。在这种情况下,主机X必须指定所有中间路由器地址。严格源路由的一个示例是 R_1 R_2 R_3 主机 Y。对于松散源路由,诸如 R_1 R_3 之类的路由就足够了。允许使用不相邻的路由器(即本例中的 R_1 和 R_3),并且具有以下优点:如果 R_2 发生故障,可以使用 R_2b 代替,反之亦然。

For instance, consider the networks and routers in Figure 18-6. Suppose Host X wants to send a packet to Host Y using the Strict Source Routing option. In this case, Host X must specify all the intermediate router addresses. An example of a strict source route would be R_1 R_2 R_3 Host Y. With loose source routing, something such as R_1 R_3 would be sufficient. The use of nonadjacent routers (i.e., R_1 and R_3 in this example) is allowed and comes with advantages: if R_2 fails, R_2b can be used instead, and vice versa.

IP源路由示例

图 18-6。IP源路由示例

Figure 18-6. Example of IP source routing

记录路线选项

Record Route Option

此选项的目的是要求源和目标之间沿途的路由器存储用于转发数据包的传出接口的 IP 地址。由于标头中的空间有限,最多只能存储九个地址(如果标头包含其他选项,则更少)。因此,数据包到达时带有前九个[ * ]存储在选项中的地址;接收者无法知道此后使用了哪些路由器。由于此选项使标头(以及 IP 数据包)一路增长,并且由于标头中可能存在其他选项,因此发送方应该保留用于存储地址的空间。如果在数据包到达目的地之前保留空间已满,则即使 IP 标头的最大大小允许,附加地址也不会添加到列表中。当没有空间存储新地址时,不会生成错误(ICMP 消息)。出于显而易见的原因,发送方应该保留 4 字节(IP 地址的大小)倍数的空间量。*[]

The purpose of this option is to ask the routers along the way between source and destination to store the IP addresses of the outgoing interfaces they use to forward the packet. Because of limited space in the header, only nine addresses at most can be stored (and even fewer, if the header contains other options). Therefore, the packet arrives with the first nine[*] addresses stored in the option; the receiver has no way of knowing what routers were used after that. Since this option makes the header (and therefore the IP packet) grow along the way, and since other options may be present in the header, the sender is supposed to reserve the space that will be used to store the addresses. If the reserved space becomes full before the packet gets to its destination, the additional addresses are not added to the list even if the maximum size of an IP header would permit it. No errors (ICMP messages) are generated when there is no room to store a new address. For obvious reasons, the sender is supposed to reserve an amount of space that is a multiple of 4 bytes (the size of an IP address).[*]

图 18-7显示了专用于该选项的 IP 标头部分如何逐跳变化。当每个路由器填充其地址时,它还会更新该pointer字段以指示选项中数据的结尾。图底部的偏移量从 1 开始,以便您可以将它们与字段的值进行比较pointer

Figure 18-7 shows how the IP header portion dedicated to the option changes hop by hop. As each router fills its address, it also updates the pointer field to indicate the end of the data in the option. The offsets at the bottom of the figure start from 1 so that you can compare them to the value of the pointer field.

记录路线选项示例

图 18-7。记录路线选项示例

Figure 18-7. Example of Record Route option

时间戳选项

Timestamp Option

此选项是最复杂的选项,因为它包含子选项,并且与“记录路由”选项不同,它处理溢出。为了管理这两个附加概念,它的标头中需要一个附加字节,如图18-8所示。

This option is the most complicated one because it contains suboptions and, unlike the Record Route option, it handles overflows. To manage those two additional concepts, it needs an additional byte in its header, as shown in Figure 18-8.

IP 时间戳选项标头

图 18-8。IP 时间戳选项标头

Figure 18-8. IP Timestamp option header

前三个字节与其他选项具有相同的含义:typelengthpointer。第四个字节实际上被分成两个字段,每个字段各有四位。最右边的四位(最低有效位)表示可以更改选项效果的子命令代码。其可能的值为:

The first three bytes have the same meaning as in the other options: type, length, and pointer. The fourth byte is actually split into two fields of four bits each. The rightmost four bits (the least significant ones) represent a subcommand code that can change the effect of the option. Its possible values are:

记录时间戳
RECORD TIMESTAMPS

每个路由器都会记录它收到数据包的时间。

Each router records the time at which it received the packet.

记录地址和时间戳
RECORD ADDRESSES AND TIMESTAMPS

与上一个子命令类似,但也保存接收接口的 IP 地址。

Similar to the previous subcommand, but the IP address of the receiving interface is saved, too.

仅在指定系统上记录时间戳
RECORD TIMESTAMPS ONLY AT THE PRESPECIFIED SYSTEMS

每个路由器都会记录其接收数据包的时间(与 RECORD TIMESTAMPS 一样),但仅限于发送者选择的特定 IP 地址。

Each router records the time at which it received the packet (as with RECORD TIMESTAMPS), but only at specific IP addresses selected by the sender.

在所有三种情况下,时间均以自当天午夜 UTC 以来的毫秒数(以 32 位变量的形式)表示。[ * ]

In all three cases, the time is expressed in milliseconds (in a 32-bit variable) since midnight UTC of the current day.[*]

其他四位代表所谓的字段overflow。由于 TIMESTAMP 选项用于记录沿途的信息,并且 IP 标头中为此目的可用的空间限制为 40 个字节,因此可能会出现路由器因空间不足而无法记录信息的情况。虽然记录路由选项处理只是忽略这种情况,使接收者不知道它发生了多少次,但 TIMESTAMP 选项overflow每次发生时都会递增该字段。很遗憾,overflow是一个 4 位字段,因此最大值可以为 15:在现代网络中,它本身很容易溢出。发生这种情况时,发生溢出的路由器必须将 ICMP 参数错误消息返回给原始发送者。

The other four bits represent what is called the overflow field. Because the TIMESTAMP option is used to record information along the route, and because the space available in the IP header for that purpose is limited to 40 bytes, there can be cases where a router is unable to record information for lack of space. While the Record Route option processing simply ignores that case, leaving the receiver ignorant of how many times it happened, the TIMESTAMP option increments the overflow field every time it happens. Unfortunately, overflow is a 4-bit field and therefore can have a maximum value of 15: in modern networks, it itself may easily overflow. When that happens, the router that experiences the overflow has to return an ICMP parameter error message back to the original sender.

虽然前两个子选项相似(它们仅在每跳上保存的内容不同),但第三个子选项略有不同,值得多说几句。数据包的原始发送者列出了它感兴趣的 IP 地址,每个地址后面都有四个字节的空间。在每一跳,选项字段pointer指示下一个 4 字节空间的偏移量。地址列表中出现的每个路由器都会用时间戳填充适当的空间并更新该pointer字段。见图18-9。图中顶部序列中带下划线的主机是添加时间戳的主机。图底部的偏移量从 1 开始,以便您可以将它们与字段的值进行比较 pointer

While the first two suboptions are similar (they differ only in what to save on each hop), the third suboption is slightly different and deserves a few more words. The packet's original sender lists the IP addresses in which it is interested, following each with four bytes of space. At each hop, the option's pointer field indicates the offset of the next 4-byte space. Each router that appears in the address list fills in the appropriate space with a timestamp and updates the pointer field. See Figure 18-9. The underlined hosts in the sequence at the top of the figure are the hosts that add the timestamps. The offsets at the bottom of the figure start from 1 so that you can compare them to the value of the pointer field.

路由器警报选项

Router Alert Option

该选项于 1995 年添加到 IP 协议定义中,并在 RFC 2113 中进行了描述。它标记需要特殊处理的数据包,而不仅仅是查看目标地址和转发数据包。例如,资源预留协议 (RSVP) 试图为数据包流创建更好的 QoS,它使用此选项告诉路由器它必须以特殊方式处理该流中的数据包。现在,最后两个字节只有一个分配值,零。这仅仅意味着路由器应该检查数据包。携带其他值的数据包是非法的,应被丢弃,并向生成它们的源生成 ICMP 错误消息。

This option was added to the IP protocol definition in 1995 and is described in RFC 2113. It marks packets that require special handling beyond simply looking at the destination address and forwarding the packet. For instance, the Resource Reservation Protocol (RSVP), which attempts to create better QoS for a stream of packets, uses this option to tell routers that it must treat the packets in that stream in a special way. Right now, the last two bytes have only one assigned value, zero. This simply means that the router should examine the packet. Packets carrying other values are illegal and should be discarded, generating an ICMP error message to the source that generated them.

为预先指定的系统存储时间戳选项的示例

图 18-9。为预先指定的系统存储时间戳选项的示例

Figure 18-9. Example of storing the Timestamp option for pre-specified systems

数据包碎片/碎片整理

Packet Fragmentation/Defragmentation

数据包分段和碎片整理是IP协议的主要工作之一。IP 协议将数据包的最大大小定义为 64 KB,这是因为 len标头字段表示数据包的大小(以字节为单位),是一个 16 位值。但是,能够发送最大 64 KB 数据包的接口类型并不多。这意味着当IP层需要传输大小大于出口接口的MTU的数据包时,需要将数据包分割成更小的块。我们将在本章后面看到,所使用的 MTU 不一定是与出口设备关联的 MTU;例如,它可以是与用于路由数据包的路由表条目相关联的条目。后者取决于几个因素,其中之一是出口设备的 MTU。

Packet fragmentation and defragmentation is one of the main jobs of the IP protocol. The IP protocol defines the maximum size of a packet as 64 KB, which comes from the fact that the len field of the header, which represents the size of the packet in bytes, is a 16-bit value. However, not many interface types can send packets of a size up to 64 KB. This means that when the IP layer needs to transmit a packet whose size is bigger than the MTU of the egress interface, it needs to split the packet into smaller pieces. We will see later in this chapter that the MTU used is not necessarily the one associated to the egress's device; it could be, for instance, the one associated with the routing table entry used to route the packet. The latter would depend on several factors, one of which is the egress device's MTU.

无论MTU如何计算,分片过程都会创建一系列大小相等的分片,如图18-10所示。图片中显示的 MF 和 OFFSET 字段将在本节后面介绍。如果 MTU 没有准确地划分数据包的原始大小,则最终的分段会比其他分段小。

Regardless of how the MTU is computed, the fragmentation process creates a series of equal-size fragments, as shown in Figure 18-10. The MF and OFFSET fields shown in the picture are described later in this section. If the MTU does not divide the original size of the packet exactly, the final fragment is smaller than the others.

IP数据包分片

图 18-10。IP数据包分片

Figure 18-10. IP packet fragmentation

分段的 IP 数据包通常由目标主机进行碎片整理,但需要查看整个 IP 数据包的中间设备也可能必须对其进行碎片整理。此类设备的两个示例是防火墙和网络地址转换 (NAT) 路由器。

A fragmented IP packet is normally defragmented by the destination host, but intermediate devices that need to look at the entire IP packet may have to defragment it, too. Two examples of such devices are firewalls and Network Address Translation (NAT) routers.

不久前,接收方分配一个原始 IP 数据包大小的缓冲区并在碎片到达时将其放入其中是一种可接受的解决方案。事实上,接收方可能只分配最大可能大小的缓冲区,因为只有在接收到最后一个片段后才知道原始 IP 数据包的大小。这种简单的方法现在已被避免,因为它浪费内存,并且恶意攻击只需发送一串与其原始大小相同的非常小的片段,就可能使路由器瘫痪。

Some time ago, it was an acceptable solution for the receiver to allocate a buffer the size of the original IP packet and put fragments there as they arrived. In fact, the receiver might just allocate a buffer of the maximum possible size, because the size of the original IP packet was known only after receiving the last fragment. That simple approach is now avoided because it wastes memory, and a malicious attack could bring a router to its knees just by sending a burst of very small fragments that lie about their original size.

因为每个 IP 数据包都可以分段,并且每个分段都可以出于同样的原因沿路径进一步分段,所以接收方必须有一种方法来了解每个分段属于哪个 IP 数据包,以及位于原始 IP 中的什么位置数据包的每个片段都应放置。接收方还必须知道 IP 数据包的原始大小,以了解其何时收到所有片段。

Because every IP packet can be fragmented, and because each fragment can be further fragmented along the path for the same reason, there must be a way for the receiver to understand which IP packet each fragment belongs to, and at what position inside the original IP packet each fragment should be placed. The receiver must also be told the original size of the IP packet to know when it has received all of the fragments.

为了实现碎片化,还必须考虑其他几个方面。当将原始数据包的 IP 标头复制到其片段中时,内核不会复制所有选项,而只会复制那些已设置字段的选项,如前面“ IP 选项copied”部分中所述。然而,当 IP 片段合并时,生成的 IP 数据包将看起来像原始数据包,因此再次包含所有选项。

Several other aspects have to be considered to accomplish fragmentation. When copying the IP header of the original packet into its fragments, the kernel does not copy all of the options, but only those with the copied field set, as described earlier in the section "IP Options." However, when the IP fragments are merged, the resulting IP packet will look like the original one and therefore include all the options again.

此外,IP 校验和仅覆盖 IP 标头(有效负载通常由高层协议覆盖)。创建片段时,标头都不同,因此必须为每个片段计算校验和,并在接收端进行检查。

Moreover, the IP checksum covers only the IP header (the payload is usually covered by the higher-layer protocols). When fragments are created, the headers are all different, so a checksum has to be computed for each one of them, and checked on the receiving side.

碎片对高层的影响

Effect of Fragmentation on Higher Layers

对数据包进行碎片整理和碎片整理会占用 CPU 时间和内存。对于负载较重的服务器,所涉及的额外资源可能相当大。分段还会带来传输带宽的开销,因为每个分段都必须包含 L2 和 L3 标头。如果片段的大小很小,则开销可能会很大。

Fragmenting and defragmenting a packet takes both CPU time and memory. For a heavily loaded server, the extra resources involved may be quite significant. Fragmentation also introduces overhead in the bandwidth used for transmission, because each fragment has to contain both the L2 and L3 headers. If the size of the fragments is small, that overhead can be significant.

理论上,更高层不知道 L3 层何时选择对数据包进行分段。[ * ]

Higher layers are theoretically unaware of when the L3 layer chooses to fragment a packet.[*]

然而,即使 TCP 和 UDP 不知道碎片/碎片整理过程,[ ]基于这两个协议构建的应用程序则不然。出于性能原因,有些人不得不担心碎片。理论上,碎片/碎片整理是一个透明的过程,但它可能会对性能产生负面影响,因为它总是会增加额外的延迟。视频会议系统是对延迟非常敏感并因此尽可能避免碎片的典型应用程序。如果您曾经尝试过,或者即使您曾经打过国际电话,您就会知道延迟太大意味着什么:通话变得非常困难。一些延迟来源是无法避免的(例如,在没有强大的 QoS 的情况下,网络拥塞),但如果可以采取措施来减少延迟,应用程序将采取非常规措施来做到这一点。许多应用程序足够聪明,可以通过考虑以下几个因素来尝试避免碎片:

However even if TCP and UDP are unaware of the fragmentation/defragmentation processes,[] the applications built on top of those two protocols are not. Some have to worry about fragmentation for performance reasons. Fragmentation/defragmentation is theoretically a transparent process, but it can have negative effects on performance because it always adds extra delay. A typical application that is very sensitive to delays, and that therefore tries to avoid fragmentation as much as possible, is a videoconferencing system. If you have ever tried one, or even if you have ever had an international phone call, you know what it means to have too big of a delay: conversing becomes very difficult. Some sources of delay cannot be avoided (such as network congestion, in the absence of robust QoS), but if something can be done to reduce that delay, the applications will take extraordinary steps to do it. Many applications are smart enough to try to avoid fragmentation by taking a few factors into consideration:

  • 首先,内核不必简单地使用出口接口的 MTU,还可以使用称为路径 MTU 发现的功能来发现它可以使用的最大数据包大小,同时避免沿特定路径产生碎片(请参阅“路径 MTU 发现”)。

  • The kernel, first of all, does not have to simply use the MTU of the egress interface, but can also use a feature called path MTU discovery to discover the largest packet size it can use while avoiding fragmentation along a particular path (see the section "Path MTU Discovery").

  • MTU 可以设置为相当安全的小值 576。这反映了 RFC 791 中的规范,即每个主机必须准备好接受最多 576 个八位位组的数据包。因此,对数据包大小的限制大大降低了碎片的可能性。如果没有明确配置为使用不同的值,许多应用程序最终都会默认使用该 MTU。

  • The MTU can be set to a fairly safe, small value of 576. This reflects the specification in RFC 791 that each host must be prepared to accept packets of up to 576 octets. This restriction on packet size thus drastically reduces the likelihood of fragmentation. Many applications end up using that MTU by default, if not explicitly configured to use a different value.

当发送方决定使用小于其可用 MTU 的数据包大小只是为了避免分段时,它还必须承担包含分段所需的额外标头的相同开销。然而,避免沿途路由器产生碎片会大大减少沿途的处理,因此对于缩短响应时间至关重要。

When a sender decides to use a packet size smaller than its available MTU just to avoid fragmentation, it must also entail the same overhead of including extra headers that fragmentation requires. However, avoiding fragmentation by routers along the way reduces processing considerably along the route and therefore can be critical for improving response time.

碎片/碎片整理使用的 IP 标头字段

IP Header Fields Used by Fragmentation/Defragmentation

以下是用于处理分段/碎片整理过程的 IP 标头字段。我们将在第 22 章中看到如何使用它们。

Here are the fields of the IP header that are used to handle the fragmentation/defragmentation process. We will see how they are used in Chapter 22.

DF(不分段)
DF (Don't Fragment)

在某些情况下,碎片可能对上层不利。例如,交互式流媒体如果碎片化,可能会产生糟糕的性能。有时,发送器知道接收器具有简单、轻量级的 IP 协议实现,因此无法处理碎片整理。为此,在 IP 数据包报头中提供了一个字段来说明是否允许分段。如果数据包超出路径上某些链路的 MTU,则会被丢弃。“路径 MTU 发现”部分显示了与路径 MTU 发现相关的此标志的用途。

There are cases where fragmentation may be bad for the upper layers. For instance, interactive, streaming multimedia can produce terrible performance if it is fragmented. And sometimes, the transmitter knows that the receiver has a simple, lightweight IP protocol implementation and therefore cannot handle defragmentation. For such purposes, a field is provided in the IP packet header to say whether fragmentation is allowed. If the packet exceeds the MTU of some link along the path, it is dropped. The section "Path MTU Discovery" shows a use for this flag associated with path MTU discovery.

MF(更多片段)
MF (More Fragments)

当节点对数据包进行分段时,它会在除最后一个分段之外的每个分段中将此标志设置为 TRUE。当接收者收到从该数据包创建的最后一个片段时,即使尚未接收到某些片段,也知道原始未分段数据包的大小。

When a node fragments a packet, it sets this flag to TRUE in each fragment except the last. The recipient knows the size of the original, unfragmented packet when it receives the last fragment created from this packet, even if some fragments have not been received yet.

片段偏移
Fragment Offset

这表示原始 IP 数据包中放置片段的偏移量。它是一个 13 位字段。由于len是 16 位字段,因此必须始终在 8 字节边界上创建片段,并且该字段的值读取为 8 字节的倍数(即左移 3 位)。偏移量 0 表示该片段是数据包中的第一个片段;该信息很重要,因为第一个片段包含与整个原始数据包相关的标头信息。

This represents the offset within the original IP packet to place the fragment. It is a 13-bit field. Since len is a 16-bit field, fragments always have to be created on 8-byte boundaries and the value of this field is read as a multiple of 8 bytes (that is, shifted left 3 bits). An offset of 0 indicates that this fragment is the first within the packet; that information is important because the first fragment contains header information related to the entire original packet.

ID
ID

IP 数据包 ID,对于 IP 数据包的所有分片来说都是相同的。正是由于这个参数,接收方才知道应该重新连接哪些片段。我们将在第 23 章的“长效 IP 对等信息” 部分中了解如何选择该字段的值。Linux 将最后使用的 ID 存储在一个名为 的结构中,其中存储了与其通信的远程主机的信息。inet_peer

IP packet ID, which is the same for all fragments of an IP packet. It is thanks to this parameter that the receiver knows what fragments should be rejoined. We will see how the value of this field is chosen in the section "Long-Living IP Peer Information" in Chapter 23. Linux stores the last ID used in a structure named inet_peer where it stores information about the remote hosts with whom it is communicating.

碎片/碎片整理问题的示例

Examples of Problems with Fragmentation/Defragmentation

分段是一个非常简单的过程:节点只需选择正确的值来适应 MTU。大多数问题都与碎片整理有关,这不足为奇。在接下来的两节中,我们将介绍两个最常见的问题: 处理重传并正确重组数据包,以及网络地址转换(NAT)的特殊问题。

Fragmentation is a pretty simple process: the node simply has to choose the right value to fit the MTU. It should not come as a surprise that most of the issues have to do with defragmentation. In the next two sections, we cover two of the most common issues: handling retransmissions and reassembling packets properly, along with the special problem of Network Address Translation (NAT).

不使用分段的另一个原因是它与拥塞控制算法不兼容。

Another reason not to use fragmentation is that it is incompatible with congestion control algorithms.

重传

Retransmissions

我之前说过,IP 数据包在完全碎片整理之前无法传递到下一个更高层。然而,这并不意味着碎片会无限期地保留在主机的内存中。否则,很容易通过简单的拒绝服务 (DoS) 攻击使主机无法使用。一个片段可能由于多种原因而无法被接收:例如,由于拥塞,它可能会被路由器丢弃,而该路由器由于拥塞而耗尽了内存来存储它;由于 CRC(错误校验),它可能会被损坏并被丢弃。 ),或者它可能被防火墙阻止,因为防火墙希望在转发任何片段之前查看第一个片段中的标头。所以,

I said earlier that an IP packet cannot be delivered to the next-higher layer until it has been completely defragmented. However, this does not mean that fragments are kept in the host's memory indefinitely. Otherwise, it would be very easy to render a host unusable through a simple Denial of Service (DoS) attack. A fragment might not be received for several reasons: for instance, it might be dropped along the way by a router that has run out of memory to store it due to congestion, it might become corrupted and be discarded due to the CRC (error check), or it could be held up by a firewall because the firewall wants to view the header in the first fragment before forwarding any fragments. Therefore, each router and host has a timer that cleans up the resources used by the fragments of an IP packet if some fragments are not received within a given amount of time.

如果发送者可以得知某个片段在路径上丢失或丢弃,那么如果发送者可以仅重传丢失的片段,那就太好了。但这是完全不可行的。发送方甚至无法知道其数据包是否在路径中稍后被路由器分段,更不用说碎片是什么了。因此,每个发送者必须简单地等待更高层告诉它重新发送整个数据包。

If a sender could tell that a fragment was lost or dropped along the path, it would be nice if the sender could retransmit just the missing fragment. This is completely unfeasible to implement, though. A sender cannot know even whether its packet was fragmented by a router later on in the path, much less what the fragments are. So each sender must simply wait for a higher layer to tell it to resend an entire packet.

重传的数据包不会重复使用与原始数据包相同的 ID。然而,主机仍然有可能接收到具有相同数据包 ID 的相同 IP 片段的副本,因此主机必须能够处理这种情况。请注意,即使没有重传,同一片段也可能会被多次接收:一个常见的例子是 L2 层存在循环时。我们在第四部分看到了这个案例。如果延迟对应用程序不利(例如,在视频会议软件中),这种浪费提供了另一个充分的理由,可以在源头避免碎片,并尝试使用数据包大小,以最大限度地减少沿途碎片的可能性。

A retransmitted packet does not reuse the same ID as the original. However, it is still possible for a host to receive copies of the same IP fragment with the same packet ID, so a host must be able to handle this situation. Note that the same fragment may be received multiple times even without retransmissions: a common example is when there's a loop at the L2 layer. We saw this case in Part IV. This waste provides another good reason to avoid fragmentation at the source and to try to use packet sizes that minimize the likelihood of fragmentation along the way if delays are bad for the application (e.g., in videoconferencing software).

由于内核无法将其数据交换到磁盘(它仅交换用户空间数据),因此处理碎片造成的内存浪费会对路由器性能产生严重影响。Linux 对片段可用的内存量进行了限制,如第 23 章“通过 /proc 文件系统进行调整”部分所述。

Since the kernel cannot swap its data out to disk (it swaps only user-space data), the memory waste due to handling fragments has a heavy impact on router performance. Linux puts a limit on the amount of memory usable by fragments, as described in the section "Tuning via /proc Filesystem" in Chapter 23.

由于IP是无连接协议,因此没有流量控制,并且由上层协议(或应用程序)来处理损失。当然,有些应用程序不太关心数据丢失,而另一些应用程序则很关心。

Since IP is a connectionless protocol, there is no flow control and it is up to the upper-layer protocols (or the applications) to take care of losses. Some applications, of course, do not care much about the loss of data, and others do.

假设上层通过某种方式检测到某些数据的丢失(例如,使用由于缺乏确认而到期的计时器)并尝试重传。由于不可能有选择地仅重新发送丢失的片段,因此 L4 协议必须重新传输整个 IP 数据包。每次重传都会导致一些特殊情况,这些情况必须由接收方处理(有时也由中间路由器处理,当后者实施某种形式的防火墙,要求对数据包进行碎片整理时)。这里是其中的一些:

Let's suppose the upper layer detects the loss of some data by some means (for instance, with a timer that expires due to the lack of acknowledgment) and tries a retransmission. Since it is not possible to selectively resend only the missing fragments, the L4 protocol has to retransmit the entire IP packet. Each retransmission can lead to some special conditions that have to be handled by the receiver side (and sometimes by intermediate routers as well when the latter implement some form of firewalling that requires packets to be defragmented). Here are some of them:

重叠
Overlapping

片段可以包含前一个数据包中已经到达的一些数据。重传的数据包具有不同的 ID,因此它们的片段不应与先前传输的片段混合。然而,一个有缺陷的操作系统不为重新传输的数据包使用不同的 ID,或者我将在下一节中介绍的环绕问题,可能会导致重叠。

A fragment could contain some of the data that already arrived in a previous packet. Retransmitted packets have a different ID and therefore their fragments are not supposed to be mixed with the fragments of a previous transmission. However, a buggy operating system that does not use a different ID for retransmitted packets, or the wraparound problem I'll introduce in the next section, can make overlapping possible.

重复项
Duplicates

这可以被认为是重叠的特殊情况,其中两个片段是相同的。如果片段以相同的偏移量开始并且具有相同的长度,则该片段被视为重复。不检查实际的有效负载内容。除非您正处于安全攻击之中,否则负载内容没有理由在同一数据包的重新传输之间发生变化。前面提到的 L2 循环也可能是重复项的来源。

This can be considered a special case of overlapping, where the two fragments are identical. A fragment is considered a duplicate if it starts at the same offset and it has the same length. There is no check on the actual payload content. Unless you are in the middle of a security attack, there is no reason why payload content should change between retransmissions of the same packet. The L2 loop mentioned previously can also be a source of duplicates.

重组完成后接待
Reception once reassembly is already complete

在这种情况下,IP 层将该片段视为新 IP 数据包的第一个片段。如果没有收到所有新片段,IP 层将在垃圾收集过程中简单地清除重复片段;否则,它会重新创建整个数据包,并且上层协议的工作是将数据包识别为副本。

In this case, the IP layer considers the fragment the first of a new IP packet. If all of the new fragments are not received, the IP layer will simply clean up the duplicates during its garbage collection process; otherwise, it re-creates the whole packet and it is the job of the upper-layer protocol to recognize the packet as a duplicate.

如果您考虑到片段也可能变得支离破碎,事情就会变得更加复杂。

Things can get more complicated if you consider that fragments can get fragmented, too.

将片段与其 IP 数据包关联

Associating fragments with their IP packets

由于碎片可能会无序到达,因此碎片整理是一个复杂的过程,需要识别每个数据包并在其到达时将其放置在适当的位置。插入、删除和合并操作必须简单快捷。

Because fragments could arrive out of order, defragmentation is a complex process that requires each packet to be recognized and put in its proper place as it arrives. The insert, delete, and merge operations must be easy and quick.

为了识别片段所属的 IP 数据包,内核会考虑以下参数:

To identify the IP packet a fragment belongs to, the kernel takes the following parameters into consideration:

  • 源IP地址和目的IP地址

  • Source and destination IP addresses

  • IP数据包ID

  • IP packet ID

  • L4协议

  • L4 protocol

不幸的是,不同的数据包可能共享所有这些参数。例如,两个不同的发送方可能会碰巧为同时到达的数据包选择相同的数据包 ID。人们可能会认为源 IP 地址会区分数据包,但如果两台主机都位于 NAT 路由器后面,并将其自己的 IP 地址放在数据包上,该怎么办?在这些条件下,接收方 IP 层无法区分片段。您也不能指望 IP ID 字段,因为它是一个 16 位字段,因此可以在快速网络上快速环绕。

Unfortunately, it is possible for different packets to share all of these parameters. For instance, two different senders could happen to choose the same packet ID for packets that happen to arrive at the same time. One might suppose that the source IP addresses would distinguish the packets, but what if both hosts sat behind a NAT router that put its own IP address on the packets? There is no way the recipient IP layer can distinguish fragments under these conditions. You cannot count on the IP ID field either, because it is a 16-bit field and can therefore wrap around pretty quickly on a fast network.

由于 IP ID 字段在碎片整理过程中起着核心作用,让我们看看 IP 片段在内存中是如何组织的以及 IP ID 是如何生成的。IP ID 生成器最明显的实现是递增全局计数器,并在每次要求 IP 层发送数据包时将其用作 ID。这将确保连续的 ID 和易于实施。然而,这个简单的模型存在一些问题:

Since the IP ID field plays a central role in the defragmentation process, let's see how IP fragments are organized in memory and how the IP IDs are generated. The most obvious implementation of an IP ID generator would be one that increments a global counter and uses it as the ID each time the IP layer is asked to send a packet. This would assure sequential IDs and easy implementation. This simple model, however, has some problems:

  • 对于所有可能共享全局 ID 的高层协议,需要某种锁定机制(尤其是在多处理器计算机中)来防止竞争条件。然而,使用这种锁会限制对称多处理(SMP)的可扩展性。

  • For all possible higher-layer protocols to share a global ID, some sort of locking mechanism would be required (especially in multiprocessor machines) to prevent race conditions. However, the use of such a lock would limit symmetric multiprocessing (SMP) scalability.

  • ID 是可预测的,这将导致一些众所周知的攻击机器的方法。

  • IDs would be predictable, which would lead to some well-known methods of attacking a machine.

  • ID 值可能会快速回绕并导致重复的 ID。由于 ID 字段是一个 16 位值,总共允许 65,535 个唯一数字,因此具有高流量和快速连接的节点可能会发现自己在旧数据包到达目的地之前为新数据包重复使用相同的 ID。例如,平均数据包大小为 512 字节,千兆位接口将在半秒内发送 65,535 个数据包。高负载的服务器可以在不到 1 秒的时间内轻松绕过全局 IP ID 计数器!

  • The ID value could wrap around quickly and lead to duplicate IDs. Because the ID field is a 16-bit value, allowing a total of 65,535 unique numbers, nodes with high traffic and fast connections might find themselves reusing the same ID for a new packet before the old one has reached its destination. For instance, with an average packet size of 512 bytes, a gigabit interface would send 65,535 packets in half a second. A highly loaded server could easily wrap around a global IP ID counter in less than 1 second!

因此,我们必须接受 IP 层偶尔将完全不同的数据包中的数据混合在一起的可能性。有问题。只有更高层才能解决问题——通常通过错误检查。

Thus, we have to accept the likelihood that the IP layer occasionally mixes together data from completely different packets. There is something wrong. Only the higher layers can fix the problem—usually with error checking.

以下部分展示了 Linux 减少(但没有解决)环绕问题和 ID 预测的可能性的一种方法。第23章中的“选择IP头的ID字段”部分显示了精确的算法和代码。

The following section shows one way in which Linux reduces the likelihood of (but does not solve) the wraparound problem and ID prediction. The section "Selecting the IP Header's ID Field" in Chapter 23 shows the precise algorithm and code.

IP ID 生成示例

Example of IP ID generation

环绕问题通过多个并发全局计数器得到部分解决。Linux 内核没有为每个目标 IP 地址保留一个不同的全局 IP ID(最多可达可能的 IP 目标的最大数量)。请注意,通过使用多个 IP ID,您可以使 ID 需要更长的时间来回绕,但最终它们还是会这样做。

The wraparound problem is partially addressed by means of multiple, concurrent, global counters. Instead of a global IP ID, the Linux kernel keeps a different one for each destination IP address (up to the maximum number of possible IP destinations). Note that by using multiple IP IDs, you make the IDs take a little longer to wrap around, but eventually they will do so anyway.

图 18-11显示了一个示例。假设我们的流量发送至地址为 IP1 和 IP2 的两台服务器。我们还假设对于每个 IP 地址,我们都有不同的独立流量流,例如 HTTP、Telnet 和 FTP。由于 IP ID 由前往同一目的地的所有流量共享,因此如果您将前往目的地的流量视为一个整体,则数据包将具有连续的 ID,但每个应用程序的流量将不会具有连续的 ID。例如,Telnet 会话生成的发送到目标 IP1 的 IP 数据包不是连续的。请注意,这只是 Linux 选择的解决方案,而不是标准。还有其他替代方案。

Figure 18-11 shows an example. Let's suppose we have traffic addressed to two servers with addresses IP1 and IP2. Let's suppose also that for each IP address we have different independent streams of traffic, such as HTTP, Telnet, and FTP. Because the IP IDs are shared by all the streams of traffic going to the same destination, the packets will have sequential IDs if you look at traffic to the destination as a whole, but the traffic of each application will not have sequential IDs. For instance, the IP packets to destination IP1 that are generated by a Telnet session are not sequential. Note that this is merely the solution chosen by Linux, and is not a standard. Other alternatives are available.

无法解决的碎片整理问题示例:NAT

Example of unsolvable defragmentation problem: NAT

尽管IP层有各种巧妙的方法,但分片规则会导致IP层无法解决的潜在情况。图 18-12显示了其中之一。假设 R 是一个为其网络上的所有主机执行 NAT 的路由器。更准确地说,我们假设 R 进行了伪装:[ * ]内部网络中的主机生成的发往 Internet 的 IP 数据包标头中的源 IP 地址被替换为路由器 R 的 IP 地址 140.105.1.1。[ ]

Despite all manner of cleverness at the IP layer, the rules of fragmentation lead to potential situations that the IP layer cannot solve. Figure 18-12 shows one of them. Let's suppose that R is a router doing NAT for all the hosts on its network. To be more precise, let's suppose R did masquerading:[*] the source IP addresses in the headers of the IP packets generated by the hosts in the internal network and addressed to the Internet are replaced with router R's IP address, 140.105.1.1.[]

我们还假设 PC1 和 PC2 都需要向同一目标服务器 S 发送一些流量。如果碰巧同时传输的两个数据包具有相同的 IP ID(在本例中为 1,000),会发生什么情况?由于路由器R将源IP地址10.0.0.2和10.0.0.3重写为140.105.1.1,服务器S会认为它收到的两个IP数据包都来自路由器R。在没有分片的情况下,这不是问题因为 L4 信息(例如端口号)区分了两个源。事实上,这就是 NAT 可用的首要原因。当 R 传输的两个 IP 数据包在到达服务器 S 之前被分段时,就会出现问题。在这种情况下,服务器 S 收到具有相同源和目标 IP 地址(140.105.1.1、151.41.21.194)和相同 IP ID(1,000)的片段,因此尝试将它们放在一起,并可能混合两个不同 IP 数据包的片段。因此,两个数据包都将被丢弃,因为它们被认为已损坏。在最坏的情况下,两个数据包可能具有相同的长度,并且重叠可能会损坏有效负载,但不会损坏 L4 标头。IP 校验和仅覆盖 IP 标头,因此无法检测到这种情况。根据应用的不同,后果可能很严重。194) 和相同的 IP ID (1,000),因此尝试将它们放在一起,并可能混合两个不同 IP 数据包的片段。因此,两个数据包都将被丢弃,因为它们被认为已损坏。在最坏的情况下,两个数据包可能具有相同的长度,并且重叠可能会损坏有效负载,但不会损坏 L4 标头。IP 校验和仅覆盖 IP 标头,因此无法检测到这种情况。根据应用的不同,后果可能很严重。194) 和相同的 IP ID (1,000),因此尝试将它们放在一起,并可能混合两个不同 IP 数据包的片段。因此,两个数据包都将被丢弃,因为它们被认为已损坏。在最坏的情况下,两个数据包可能具有相同的长度,并且重叠可能会损坏有效负载,但不会损坏 L4 标头。IP 校验和仅覆盖 IP 标头,因此无法检测到这种情况。根据应用的不同,后果可能很严重。两个数据包都将被丢弃,因为它们被认为已损坏。在最坏的情况下,两个数据包可能具有相同的长度,并且重叠可能会损坏有效负载,但不会损坏 L4 标头。IP 校验和仅覆盖 IP 标头,因此无法检测到这种情况。根据应用的不同,后果可能很严重。两个数据包都将被丢弃,因为它们被认为已损坏。在最坏的情况下,两个数据包可能具有相同的长度,并且重叠可能会损坏有效负载,但不会损坏 L4 标头。IP 校验和仅覆盖 IP 标头,因此无法检测到这种情况。根据应用的不同,后果可能很严重。

Let's also suppose that both PC1 and PC2 need to send some traffic to the same destination server S. What would happen if, by chance, two packets transmitted at more or less the same time had the same IP ID (in this example, 1,000)? Since the router R rewrites the source IP address changing 10.0.0.2 and 10.0.0.3 into 140.105.1.1, server S will think that the two IP packets it received both came from router R. In the absence of fragmentation, this is not a problem because the L4 information (for instance, the port number) distinguishes the two sources. In fact, that is what makes NAT usable in the first place. The problem arises when the two IP packets transmitted by R get fragmented before arriving at server S. In this case, server S receives fragments with the same source and destination IP address (140.105.1.1, 151.41.21.194) and the same IP ID (1,000), and therefore tries to put them together and potentially mixes the fragments of two different IP packets. As a consequence of this, both of the packets will be discarded because they are considered corrupted. In the very worst case, the two packets could have the same length and the overlapping could corrupt the payload without corrupting the L4 headers. The IP checksum covers only the IP header and therefore cannot detect this condition. Depending on the application, the consequences could be serious.

接收非连续 IP 标头 ID 的并发应用程序

图 18-11。接收非连续 IP 标头 ID 的并发应用程序

Figure 18-11. Concurrent applications receiving non consecutive IP header IDs

列举了所有碎片问题之后 ,我们可以更好地理解为什么 IPv6 协议的设计者决定只允许在原始主机上进行 IP 分段,而不在路由器等中间主机上进行分段。

After an enumeration of all the problems with fragmentation , we can understand better why the designers of the IPv6 protocol decided to allow IP fragmentation only at the originating hosts, and not at intermediate hosts such as routers.

NAT 和 IP 分段可能带来麻烦的示例

图 18-12。NAT 和 IP 分段可能带来麻烦的示例

Figure 18-12. Example where NAT and IP fragmentation could give trouble

路径 MTU 发现

Path MTU Discovery

经过对数据包分段的陷阱的长时间讨论后,读者可以很好地理解我将讨论的下一个 IP 层功能,即路径 MTU 发现。

After the long discussion of the pitfalls of packet fragmentation, readers can well appreciate the next IP layer feature I'll discuss, path MTU discovery.

net_device当我在 第 2 章描述数据结构时,我列出了最常见接口类型的 MTU。MTU 的范围是网络接口所连接的 LAN。如果您将 IP 数据包传输到与您用于传输的接口位于同一 LAN 上的另一台主机,并且数据包的大小大于 LAN 的 MTU,则必须对 IP 数据包进行分段。但是,如果您选择适合 MTU 的大小,则可以确保不需要碎片。当目标主机不在直接连接的 LAN 上时,您无法依靠 LAN 的 MTU 来推断是否会发生碎片。这就是路径 MTU 发现的用武之地。

When I described the net_device data structure in Chapter 2, I listed the MTUs of the most common interface types. The scope of the MTU is the LAN that the network interface is connected to. If you transmit an IP packet to another host on the same LAN as the interface you use to transmit, and the size of the packet is bigger than the LAN's MTU, the IP packet will have to be fragmented. However, if you chose a size that fits the MTU, you can ensure that no fragmentation will be required. When the destination host is not on a directly attached LAN, you cannot count on the LAN's MTU to derive whether fragmentation will take place. Here is where path MTU discovery comes in.

路径 MTU 发现用于发现传输到给定目标地址的数据包在不被分段的情况下可以具有的最大大小。该参数称为路径 MTU (PMTU)。基本上,PMTU 是从一台主机到另一台主机的路由上的所有连接中遇到的最小 MTU。

Path MTU discovery is used to discover the biggest size a packet transmitted to a given destination address can have without being fragmented. That parameter is called the Path MTU (PMTU) . Basically, the PMTU is the smallest MTU encountered along all the connections along the route from one host to the other.

由于两个端点之间的路径可能是不对称的,因此对于任何给定的主机对都可以有两个不同的 PMTU。每台主机都会计算并使用适合向另一台主机发送数据包的主机。此外,路线的改变可能导致PMTU的改变。

Since the path between two endpoints can be asymmetric, it follows that there can be two different PMTUs for any given pair of hosts. Each host computes and uses the one appropriate for sending packets to the other. Furthermore, a change of route can lead to a change of PMTU.

由于每个目标 IP 地址可以使用不同的 PMTU,因此它被缓存在关联的路由表缓存条目中。我们在第七部分会看到,路由表中的路由可以聚合多个IP地址;例如,您可以有一条路由,表明可以通过网关 10.0.2.1 访问网络 10.0.1.0/24。另一方面,路由表缓存对于主机最近与之通信的每个目标 IP 地址都有一个条目。[ * ]因此,您可能有一个主机 10.0.1.2 的条目和另一个主机 10.0.1.3 的条目,即使它们是通过同一网关访问的。这些条目中的每一个都包含一个唯一的 PMTU。您可能会反对,如果这两个地址属于同一 LAN 内的两个主机,则第三个主机可能会使用相同的路由来到达这两个主机,因此共享相同的 PMTU。在路由表中只保留一个 PMTU 是有意义的。不幸的是这是不可能的。仅仅因为一条路由用于到达一堆地址并不一定意味着它们属于同一个 LAN。路由是一个复杂的主题,我们将在第七部分中介绍它的几个方面。

Since each destination IP address can use a different PMTU, it is cached in the associated routing table cache entry. We will see in Part VII that the routes in the routing table can aggregate several IP addresses; for instance, you can have a route that says that network 10.0.1.0/24 is reachable via gateway 10.0.2.1. The routing table cache, on the other hand, has one single entry for each destination IP address the host has been talking to in the recent past.[*] You may therefore have an entry for host 10.0.1.2 and another one for 10.0.1.3, even though they are reached through the same gateway. Each of those entries includes a unique PMTU. You may object that, if those two addresses belong to two hosts within the same LAN, a third host would probably use the same route to reach both hosts and therefore share the same PMTU. It would make sense to keep just one PMTU in the routing table. This is unfortunately not possible. Just because one route is used to reach a bunch of addresses does not necessarily mean that they belong to the same LAN. Routing is a complex subject, and we will cover several aspects of it in Part VII.

每个路由表条目都与一个出口设备关联:[ ]用于将流量传输到路由上下一跳的设备。如果设备直接连接到其通讯端并且启用了 PMTU 发现,则 PMTU 默认设置为出口设备的 MTU。

Each routing table entry is associated with an egress device:[] the device to use to transmit traffic to the next hop along the route. If the device is directly connected to its correspondent and PMTU discovery is enabled, the PMTU is set by default to the MTU of the egress device.

直接连接的设备包括电信电缆的两个端点或以太网 LAN 上的设备。对于 LAN 上的所有设备(它们之间没有路由器)共享相同的 MTU 以确保正常运行尤其重要。

Directly connected devices include the two endpoints of a telecom cable or devices on an Ethernet LAN. It's particularly important for all devices on the LAN (with no router between them) to share the same MTU for proper operation.

如果设备未直接连接(即,如果设备之间至少有一个路由器),或者如果禁用 PMTU 发现,则 PMTU 默认设置为 576。这不是随机值,而是在原始 IP RFC 中定义的791. [ ]无论默认值如何,管理员都可以通过用户空间配置程序(例如ifconfig )设置初始 PMTU 。

If devices are not directly connected—that is, if at least one router lies between them—or if PMTU discovery is disabled, the PMTU by default is set to 576. This is not a random value, but is defined in the original IP RFC 791.[] Regardless of the default, an administrator can set the initial PMTU through a user-space configuration program such as ifconfig.

让我们看看 PMTU 发现是如何工作的。该算法只是利用 IP 标头的字段来处理分段/碎片整理以及相关的 ICMP 消息。

Let's see how PMTU discovery works. The algorithm simply takes advantage of the IP header's fields used to handle fragmentation/defragmentation and the associated ICMP messages.

如果您传输一个在标头中设置了 DF 标志的 IP 数据包并且没有人抱怨,则意味着在到目的地的路径上没有发生碎片,并且您使用的 PMTU 没有问题。这并不意味着您使用的是最佳尺寸。您很可能能够增加 PMTU 并且仍然不会出现碎片。一个简单的例子是两个以太网 LAN 通过路由器连接。在网络两侧,MTU 均为 1,500,但每个 LAN 的主机使用 MTU 576 与另一个 LAN 的主机通信,因为它们不是直接连接的。这不是最佳的。

If you transmit an IP packet with the DF flag set in the header and no one complains, it means that no fragmentation has taken place along the path to the destination, and that the PMTU you used is fine. This does not mean you are using the optimal size. You might well be able to increase the PMTU and still not have fragmentation. A simple example is where two Ethernet LANs are connected by a router. On both sides of the network, the MTU is 1,500, but hosts of each LAN use the MTU of 576 to talk to the hosts of the other LAN because they are not directly connected. This is not optimal.

如果将探测中数据包的大小增加到最佳大小,则当您跨越实际 PMTU 时,您将收到 ICMP 消息通知。ICMP 消息将包含发出投诉的设备的 MTU,以便内核可以相应地更新本地 PMTU。

If you increase the size of the packets in a probe to their optimal size, you will be notified with an ICMP message when you cross the real PMTU. The ICMP message will include the MTU of the device that complained so that the kernel can update the local PMTU accordingly.

可以将 Linux 配置为通过以下方式之一处理路径 MTU 发现:

Linux can be configured to handle path MTU discovery in one of the following ways:

IP_PMTUDISC_DONT
IP_PMTUDISC_DONT

切勿发送标头中设置了 DF 标志的 IP 数据包;因此,不要使用路径 MTU 发现。

Never send IP packets with the DF flag set in the header; therefore, do not use path MTU discovery.

IP_PMTUDISC_DO
IP_PMTUDISC_DO

始终在本地节点上生成的数据包(而不是转发的数据包)的标头中设置 DF 标志,以尝试为每次传输找到最佳的 PMTU。

Always set the DF flag in the header of packets generated on the local node (not forwarded ones), in an attempt to find the best PMTU for every transmission.

IP_PMTUDISC_WANT
IP_PMTUDISC_WANT

决定是否基于每个路由使用路径 MTU 发现。这是默认设置。

Decide whether to use path MTU discovery on a per-route basis. This is the default.

当启用路径 MTU 发现时,与路由关联的 PMTU 可以随时更改以包含最大尺寸较小的路由器,从而导致源接收 ICMP FRAGMENTATION NEEDED 消息(请参阅第 25 章icmp_unreach中的讨论)。在这种情况下,将为路由缓存中具有相同目的地的所有条目更新 PMTU。[ * ]参见第33章“有效期标准”部分 有关路由表如何处理 ICMP FRAGMENTATION NEEDED 消息接收的详细信息。应该注意的是,该算法总是缩小 PMTU,而不会增加它。然而,其 PMTU 源自入口 ICMP FRAGMENTATION NEEDED 消息的路由缓存条目在一段时间后过期,这相当于返回到(更大的)默认 PMTU。有关更多详细信息,请参阅刚刚引用的同一部分。

When path MTU discovery is enabled, the PMTU associated with a route can change at any time to include routers with a smaller maximum size, resulting in the source receiving an ICMP FRAGMENTATION NEEDED message (see the discussion of icmp_unreach in Chapter 25). In this case, the PMTU is updated for all the entries in the routing cache with the same destination.[*] Refer to the section "Expiration Criteria" in Chapter 33 for details on how the reception of the ICMP FRAGMENTATION NEEDED message is handled by the routing table. It should be noted that the algorithm always shrinks the PMTU, it never increases it. However, the entries of the routing cache whose PMTU is derived from an ingress ICMP FRAGMENTATION NEEDED message expire after some time, which is equivalent to going back to the (bigger) default PMTU. See the same section just referenced for more details.

也可以在通过iproute命令添加路由时手动设置路由的PMTU 。

The PMTU of a route can also be set manually when adding the route through the ip route command.

即使启用了路径 MTU 发现,仍然可以锁定当前 PMTU,使其不被更改。这主要发生在两种情况下:

Even if path MTU discovery was enabled, it is still possible to lock the current PMTU so that it will not be changed. This happens in two main cases:

  • 当使用iproute设置PMTU时,可以使用lock关键字来锁定它。以下示例添加一条通过下一跳网关 100.100.100.1 到 10.10.1.0/24 网络的路由,并将 PMTU 锁定为 750 字节:

    ip 路由通过 100.100.100.1 添加 10.10.1.0/24 MTU 锁定 750
  • When using ip route to set the PMTU, it is possible to lock it with the lock keyword. The following example adds a route to the 10.10.1.0/24 network via the next hop gateway 100.100.100.1 and locks the PMTU to 750 bytes:

    ip route add 10.10.1.0/24 via 100.100.100.1 mtu lock 750
  • 如果由于收到 ICMP FRAGMENTATION NEEDED 消息而应使用的 PMTU 小于允许的最小值,则 PMTU 将设置为该最小值并锁定。最小值可以通过 /proc/sys/net/ipv4/route/min_pmtu文件进行配置(参见第 36 章中的“ /proc/sys/net/ipv4/route 目录”部分)。在任何情况下,PMTU 都不能设置为低于 68 的值,如 RFC 1191 第 3.0 节(以及 RFC 791“分段和重组”节间接要求)的要求。另请参阅第 33 章中的“到期标准”部分。

  • If the PMTU you are supposed to use as a consequence of a received ICMP FRAGMENTATION NEEDED message is smaller than the minimum allowed value, the PMTU is set to that minimum value, and locked. The minimum value can be configured with the /proc/sys/net/ipv4/route/min_pmtu file (see the section "The /proc/sys/net/ipv4/route Directory" in Chapter 36). In any case, the PMTU cannot be set to a value lower than 68, as requested by RFC 1191, section 3.0 (and indirectly by RFC 791, section "Fragmentation and reassembly"). See also the section "Expiration Criteria" in Chapter 33.

在 Linux 中,该ip_dont_fragment函数(如 第 22 章所示)使用此处描述的注意事项来决定当数据包超过 PMTU 时是否应该对其进行分段。

In Linux, the ip_dont_fragment function (shown in Chapter 22) uses the considerations described here to decide whether a packet should be fragmented when it exceeds the PMTU.

给定传输的 PMTU 值还会受到以下因素的影响:

The value of the PMTU on a given transmission can also be influenced by the following factors:

  • 设备的 MTU 是否从用户空间显式配置

  • Whether the device's MTU is explicitly configured from user space

  • 应用程序是否更改了给定 TCP 套接字上使用的最大段大小 (mss)

  • Whether the application has changed the maximum segment size (mss) to use on a given TCP socket

校验和

Checksums

校验是网络协议用来识别传输错误的冗余字段。一些校验和 不仅可以检测错误,还可以自动修复某些类型的错误。

A checksum is a redundant field used by network protocols to recognize transmission errors. Some checksums cannot only detect errors, but also automatically fix errors of certain types.

校验和背后的想法很简单。在传输数据包之前,发送方计算一个小的、固定长度的字段(校验和),其中包含某种数据的哈希值。如果数据的一些位在传输过程中发生变化,则损坏的数据很可能会产生不同的校验和。根据您用来生成校验和的函数,它提供不同级别的可靠性。IP协议使用的校验和是一种简单的校验和,涉及和和补码,其强度太弱以至于不能被认为是可靠的。为了进行更可靠的健全性检查,您必须依赖 L2 CRC 或 SSL/IPSec 消息身份验证代码 (MAC)。

The idea behind a checksum is simple. Before transmitting a packet, the sender computes a small, fixed-length field (the checksum) containing a sort of hash of the data. If a few bits of the data were to change during transit, it is likely that the corrupted data would produce a different checksum. Depending on what function you used to produce the checksum, it provides different levels of reliability. The checksum used by the IP protocol is a simple one involving sums and one's complements, which is too weak to be considered reliable. For a more reliable sanity check, you must rely on L2 CRCs or SSL/IPSec Message Authentication Codes (MACs).

不同的协议可以使用不同的校验和算法。IP 协议校验和仅涵盖 IP 标头。大多数 L4 协议的校验和涵盖其标头和数据。

Different protocols can use different checksum algorithms. The IP protocol checksum covers only the IP header. Most L4 protocols' checksums cover both their header and the data.

在 L2(例如,以太网)有一个校验和,在 L3(例如,IP)有另一个校验和,在 L4(例如,TCP)有另一个校验和似乎是多余的,因为它们通常都适用于数据的重叠部分,但是支票很有价值。错误不仅可能发生在传输过程中,而且可能发生在层与层之间移动数据时。此外,每个协议都负责确保自己的正确传输,并且不能假设其之上或之下的层承担该任务。

It may seem redundant to have a checksum at L2 (e.g., Ethernet), another one at L3 (e.g., IP), and another one at L4 (e.g., TCP), because they often all apply to overlapping portions of data, but the checks are valuable. Errors can occur not only during transmission, but also while moving data between layers. Moreover, each protocol is responsible for ensuring its own correct transmission, and cannot assume that layers above or below it take on that task.

作为可能出现的复杂场景的示例,假设 LAN1 中的 PC A 通过 Internet 将数据发送到 LAN2 中的 PC B。我们还假设 LAN1 中使用的 L2 协议使用校验和,但 LAN2 中使用的 L2 协议不使用校验和。对于至少一个较高层来说,提供某种形式的校验和以减少接受损坏数据的可能性非常重要。

As an example of the complex scenarios that can arise, imagine that PC A in LAN1 sends data over the Internet to PC B in LAN2. Let's also suppose that the L2 protocol used in LAN1 uses a checksum but that the one on LAN2 doesn't. It's important for at least one higher layer to provide some form of checksum to reduce the likelihood of accepting corrupted data.

建议在每个协议定义中使用校验和,但这不是必需的。然而,我们必须承认,相关协议的更好设计可以消除不同层协议中重叠的功能所带来的一些开销。由于大多数 L2 和 L4 协议都提供校验和,因此在 L3 上也提供校验和并不是绝对必要的。正是由于这个原因,校验和已从 IPv6 中删除。

The use of a checksum is recommended in every protocol definition, although it is not required. Nevertheless, one has to admit that a better design of related protocols could remove some of the overhead imposed by features that overlap in the protocols at different layers. Because most L2 and L4 protocols provide checksums, having it at L3 as well is not strictly necessary. For exactly this reason, the checksum has been removed from IPv6.

在 IPv4 中,IP 校验和是一个 16 位字段,涵盖整个 IP 标头(包括选项)。校验和首先由数据包的源计算,然后逐跳更新直至到达目的地,以反映每个路由器应用的标头的更改。在更新校验和之前,每一跳首先必须通过将数据包中包含的校验和与本地计算的校验和进行比较来检查数据包的完整性。如果健全性检查失败,则数据包将被丢弃,但不会生成 ICMP:L4 协议将处理它(例如,如果在给定时间内未收到确认,将使用计时器强制重传)。

In IPv4, the IP checksum is a 16-bit field that covers the entire IP header, options included. The checksum is first computed by the source of the packet, and is updated hop by hop all the way to its destination to reflect changes to the header applied by each router. Before updating the checksum, each hop first has to check the sanity of the packet by comparing the checksum included in the packet with the one computed locally. A packet is discarded if the sanity check fails, but no ICMP is generated: the L4 protocol will take care of it (for example, with a timer that will force a retransmission if no acknowledgment is received within a given amount of time).

以下是一些触发需要更新校验和的情况:

Here are some cases that trigger the need to update the checksum:

减少 TTL
Decrementing the TTL

路由器在转发数据包之前必须减少其 IP 标头中数据包的 TTL。由于 IP 校验和也覆盖了该字段,因此原始校验和不再有效。您将在第 20 章的“ ip_forward 函数”部分中看到,TTL 随 一起减小,这也负责校验和。ip_decrease_ttl

A router has to decrement a packet's TTL in its IP header before forwarding it. Since the IP checksum also covers that field, the original checksum is no longer valid. You will see in the section "ip_forward Function" in Chapter 20 that the TTL is decreased with ip_decrease_ttl, which takes care of the checksum, too.

数据包修改(包括 NAT)
Packet mangling (including NAT)

所有涉及一个或多个 IP 标头字段更改的功能都会强制重新计算校验和。NAT 可能是最著名的例子。

All of those features that involve the change of one or more of the IP header fields force the checksum to be recomputed. NAT is probably the best-known example.

IP 选项处理
IP options handling

由于选项是标头的一部分,因此它们被校验和覆盖。因此,每次以需要添加或修改 IP 标头(即添加时间戳)的方式处理它们时,都会强制重新计算校验和。

Since the options are part of the header, they are covered by the checksum. Therefore, every time they are processed in a way that requires adding or modifying the IP header (i.e., the addition of a timestamp) forces the recomputation of the checksum.

碎片化
Fragmentation

当数据包被分段时,每个分段都有不同的标头。大多数字段保持不变,但与分片相关的字段(例如 offset、 )有所不同。因此,必须重新计算校验和。

When a packet is fragmented, each fragment has a different header. Most of the fields remain unchanged, but the ones that have to do with fragmentation, such as offset, are different. Therefore, the checksum has to be recomputed.

由于 IP 协议使用的校验和是使用 TCP、UDP 和 ICMP 使用的相同简单算法计算的,因此已编写了一组通用函数供所有这些协议使用。还有一个针对 IP 校验和优化的专门函数。根据 IP 校验和算法的定义,报头被分割成 16 位字,并进行求和和补码。为简单起见,图 18-13显示了仅对两个 16 位字进行校验和计算的示例。Linux 不对 16 位字求和,但它对 32 位字甚至 64 位长求和,这会导致更快的计算(这需要在和及其补码的计算之间有一个额外的步骤;请参阅csum_fold在下一节中)。在大多数体系结构上,实现该算法的函数(称为 )ip_fast_csum是直接用汇编语言编写的。

Since the checksum used by the IP protocol is computed using the same simple algorithm that is used by TCP, UDP, and ICMP, a general set of functions has been written to be used by all of them. There is also a specialized function optimized for the IP checksum. According to the definition of the IP checksum algorithm, the header is split into 16-bit words that are summed and ones-complemented. Figure 18-13 shows an example of checksum computation on only two 16-bit words for simplicity. Linux does not sum 16-bit words, but it does sum 32-bit words and even 64-bit longs, which results in faster computation (this requires an extra step between the computation of the sum and its one's complement; see the description of csum_fold in the next section). The function that implements the algorithm, called ip_fast_csum, is written directly in Assembly language on most architectures.

IP校验和计算

图 18-13。IP校验和计算

Figure 18-13. IP checksum computation

用于校验和计算的 API

APIs for Checksum Computation

L3 (IP) 校验和的计算速度比 L4 校验和快得多,因为它仅涵盖 IP 标头。因为它是一种廉价的操作,所以通常在软件中计算。

The L3 (IP) checksum is much faster to compute than the L4 checksum, because it covers only the IP header. Because it's a cheap operation, it is often computed in software.

用于计算校验和的通用函数集放置在每个体系结构文件include/asm- /checksum.h中。(例如,用于 i386 平台的为include/asm-i386/checksum.h。)每个协议都使用正确的输入参数直接调用通用函数,或者定义一个调用通用函数的包装器。当更改先前校验和的数据(例如 IP 标头)时,校验和算法允许协议简单地更新校验和,而不是从头开始重新计算它。xxx

The set of general functions used to compute checksums are placed in the per-architecture files include/asm- xxx /checksum.h. (The one for the i386 platform, for instance, is include/asm-i386/checksum.h.) Each protocol calls the general function directly using the right input parameters, or defines a wrapper that calls the general functions. The checksumming algorithm allows a protocol to simply update a checksum, instead of recomputing it from scratch, when changing a previously checksummed piece of data such as the IP header.

此处显示了checksum.h中一个特定于 IP 的函数的原型。ip_fast_csum该函数将指向 IP 标头的指针 ( iph) 及其长度 ( ihl) 作为参数。后者可能会因 IP 选项而改变。返回值是校验和。此函数利用 IP 标头长度始终是 4 字节的倍数这一事实来简化某些处理。

The prototype for one IP-specific function in checksum.h, ip_fast_csum, is shown here. The function takes as parameters the pointer to the IP header (iph), and its length (ihl). The latter can change due to IP options. The return value is the checksum. This function takes advantage of the fact that the IP header is always a multiple of 4 bytes in length to streamline some of the processing.

静态内联
无符号短 ip_fast_csum(无符号 char * iph, 无符号 int ihl)
static inline
unsigned short ip_fast_csum(unsigned char * iph, unsigned int ihl)

当计算要传输的数据包上的 IP 标头的校验和时,应首先将 的值iphdr->check清零,因为校验和不应反映校验和本身。在此算法中,由于它使用简单求和,因此零值字段被有效地从生成的校验和中排除。这就是为什么在代码中的不同位置您可以看到该字段在调用之前被清零ip_fast_csum

When computing the checksum of an IP header on a packet to be transmitted, the value of iphdr->check should first be zeroed out because the checksum should not reflect the checksum itself. In this algorithm, because it uses simple summing, a zero-value field is effectively excluded from the resulting checksum. This is why in different places in the code you can see that this field is zeroed right before the call to ip_fast_csum.

校验和算法有一个有趣的属性,最初可能会让阅读数据包转发和接收源代码的人感到困惑。如果校验和正确,并且转发或接收节点在整个报头上运行算法(保留原始字段iphdr->check),则获得零结果。如果您查看该函数ip_rcv,您会发现这正是根据校验和验证输入数据包的方式。这种检查损坏的方法比更直观的将字段清零iphdr->check并重新计算的方法更快。

The checksum algorithm has an interesting property that may initially confuse people who read the source code for packet forwarding and reception. If the checksum is correct, and the forwarding or receiving node runs the algorithm over the entire header (leaving the original iphdr->check field in place), a result of zero is obtained. If you look at the function ip_rcv, you can see that this is exactly how input packets are validated against the checksum. This way of checking for corruption is faster than the more intuitive way of zeroing out the iphdr->check field and recomputing.

以下是用于计算或更新 IP 校验和的主要函数:

Here are the main functions used to compute or update an IP checksum:

ip_compute_csum
ip_compute_csum

计算校验和的通用函数。它只是接收任意大小的缓冲区作为输入。

A general-purpose function that computes a checksum. It simply receives as input a buffer of an arbitrary size.

ip_fast_csum
ip_fast_csum

给定 IP 标头和长度,计算并返回 IP 校验和。它既可用于验证输入数据包,也可用于计算传出数据包的校验和。

您可以考虑针对 IP 标头进行优化ip_fast_csum的变体。ip_compute_csum

Given an IP header and length, computes and returns the IP checksum. It can be used both to validate an input packet and to compute the checksum of an outgoing packet.

You can consider ip_fast_csum a variation of ip_compute_csum optimized for IP headers.

ip_send_check
ip_send_check

计算传出数据包的 IP 校验和。它是一个预先将 ip_fast_csum其归零的简单包装iphdr->check

Computes the IP checksum of an outgoing packet. It is a simple wrapper to ip_fast_csum that zeros iphdr->check beforehand.

ip_decrease_ttl
ip_decrease_ttl

当更改 IP 标头的单个字段时,对 IP 校验和应用增量更新比从头开始计算要快。这要归功于用于计算校验和的简单算法。一个常见的示例是转发的数据包,因此其iphdr->ttl字段会递减。ip_decrease_ttl被称为内ip_forward

When changing a single field of an IP header, it is faster to apply an incremental update to the IP checksum than to compute it from scratch. This is possible thanks to the simple algorithm used to compute the checksum. A common example is a packet that is forwarded and therefore gets its iphdr->ttl field decremented. ip_decrease_ttl is called within ip_forward.

前面提到的checksum.h文件中还有其他几个通用支持例程 ,但它们主要由 L4 协议使用。例如:

There are several other general support routines in the previously mentioned checksum.h file, but they are mostly used by L4 protocols. For instance:

skb_checkum
skb_checkum

它在net/core/skbuff.c中定义,是一个通用校验和函数,由多个包装器使用(包括前面列出的一些函数),并且主要由 L4 协议在特定情况下使用。

Defined in net/core/skbuff.c, it is a general-purpose checksumming function used by several wrappers (including some of the functions listed earlier), and used mostly by L4 protocols for specific situations.

csum_fold
csum_fold

将 32 位值的 16 个最高有效位折叠为 16 个最低有效位,然后对输出值求补。此操作通常是校验和计算的最后阶段。

Folds the 16 most-significant bits of a 32-bit value into the 16 least-significant bits and then complements the output value. This operation is normally the last stage of a checksum computation.

csum_partial[_ xxx ]
csum_partial[_ xxx ]

该函数系列计算缺少由 完成的最终折叠的校验和csum_fold。L4 协议可以调用其中一个 csum_partial函数来计算 L4 数据的校验和,然后调用一个函数,例如csum_tcpudp_magic计算伪标头的校验和(在下一节中描述),最后将两个部分校验和相加并折叠结果。

csum_partial它的一些变体是在大多数体系结构上用汇编语言编写的。

This family of functions computes a checksum that lacks the final folding done by csum_fold. L4 protocols can call one of the csum_partial functions to compute the checksum on the L4 data, then invoke a function such as csum_tcpudp_magic that computes the checksum on a pseudoheader (described in the following section), and finally sums the two partial checksums and folds the result.

csum_partial and some of its variations are written in assembly language on most architectures.

csum_block_add
csum_block_add

csum_block_sub
csum_block_sub

分别对两个校验和进行求和和相减。当增量计算数据块的校验和时,第一个很有用。当从已计算校验和的数据中删除一条数据时,可能需要第二个数据。许多其他函数在内部使用这两个函数。

Sum and subtract two checksums, respectively. The first one is useful when the checksum over a block of data is computed incrementally. The second one might be needed when a piece of data is removed from one whose checksum had already been computed. Many of the other functions use these two internally.

skb_checksum_help
skb_checksum_help

该函数有两种不同的行为,具体取决于它传递的是入口 IP 数据包还是出口 IP 数据包。

在入口数据包上,它会使 L4 硬件校验和无效。

在出口数据包上,它计算 L4 校验和。例如,当无法使用出口设备的硬件校验和功能时(请参见 dev_queue_xmit11 章)),或者当 L4 硬件校验和已失效并因此需要重新计算时。例如,通过 Netfilter 的 NAT 操作,或者当 IPsec 套件的转换协议通过在原始 IP 标头和 L4 标头之间插入附加标头来破坏 L4 有效负载时,校验和可能会失效。另请注意,如果设备可以在硬件中计算 L4 校验和并将其存储在 L4 标头中,那么它最终会修改 L3 有效负载,当后者已被 IPsec 套件消化或加密时,这是不可能的,因为它会修改 L3 有效负载。使数据无效。

This function has two different behaviors, depending on whether it is passed an ingress IP packet or an egress IP packet.

On ingress packets, it invalidates the L4 hardware checksum.

On egress packets, it computes the L4 checksum. It is used, for example, when the hardware checksumming capabilities of the egress device cannot be used (see dev_queue_xmit in Chapter 11), or when the L4 hardware checksum has been invalidated and therefore needs to be recomputed. A checksum can be invalidated, for example, by a NAT operation from Netfilter, or when the transformation protocols of the IPsec suite mangle the L4 payload by inserting additional headers between the original IP header and the L4 header. Note also that if a device could compute the L4 checksum in hardware and store it in the L4 header, it would end up modifying the L3 payload, which is not possible when the latter has been digested or encrypted by the IPsec suite, because it would invalidate the data.

csum_tcpudp_magic
csum_tcpudp_magic

计算 TCP 和 UDP 伪标头的校验和(见图18-14)。

Compute the checksum on the TCP and UDP pseudoheader (see Figure 18-14).

较新的 NIC 可以在硬件中提供 IP 和 L4 校验和计算。虽然 Linux 利用了大多数现代 NIC 的 L4 硬件校验和功能,但它没有利用 IP 硬件校验和功能,因为它不值得额外的复杂性(即,考虑到 IP 大小有限,软件计算已经足够快)标题)。硬件校验和只是 CPU 卸载的一个示例,它允许内核更快地处理数据包;大多数现代 NIC 也提供一些 L4(主要是 TCP)卸载。第 19 章简要介绍了硬件校验和 。

Newer NICs can provide both the IP and L4 checksum computations in hardware. While Linux takes advantage of the L4 hardware checksumming capabilities of most modern NICs, it does not take advantage of the IP hardware checksumming capabilities because it's not worth the extra complexity (i.e., the software computation is already fast enough given the limited size of the IP header). Hardware checksumming is only one example of CPU offloading that allows the kernel to process packets faster; most modern NICs provide some L4 (mainly TCP) offloading, too. Hardware checksumming is briefly described in Chapter 19.

L4 校验和的更改

Changes to the L4 Checksum

TCP 和 UDP 协议计算一个校验和,其中包含它们的标头、有效负载以及所谓的 伪标头,伪标头基本上是一个块,为了方便起见,其字段取自 IP 标头(见图18-14)。换句话说,IP 标头中出现的一些信息最终会合并到 L4 校验和中。请注意,伪标头仅用于计算校验和;它不存在于线路上的数据包中。

The TCP and UDP protocols compute a checksum that covers their header, their payloads, and what is known as the pseudoheader, which is basically a block whose fields are taken from the IP header for convenience (see Figure 18-14). In other words, some information that appears in the IP header ends up being incorporated in the L4 checksum . Note that the pseudoheader is defined only for computing the checksum; it does not exist in the packet on the wire.

TCP 和 UDP 在计算校验和时使用的伪标头

图 18-14。TCP 和 UDP 在计算校验和时使用的伪标头

Figure 18-14. Pseudoheader used by TCP and UDP while computing the checksum

不幸的是,IP 层有时需要更改 TCP 和 UDP 在其伪标头中使用的某些 IP 标头字段,以进行 NAT 或其他活动。IP 级别的更改使 L4 校验和无效。如果校验和保留在原位,则 IP 层的任何节点都不会检测到任何错误,因为它们仅验证 IP 校验和。然而,目标主机的 TCP 层会认为数据包已损坏。因此,这种情况必须由内核处理。

Unfortunately, the IP layer sometimes needs to change some of the IP header fields, for NAT or other activities, that were used by TCP and UDP in their pseudoheaders. The change at the IP level invalidates the L4 checksums. If the checksum is left in place, none of the nodes at the IP layer will detect any error because they validate only the IP checksum. However, the TCP layer of the destination host will believe the packet is corrupted. This case therefore has to be handled by the kernel.

此外,在某些常规情况下,硬件对接收帧计算的 L4 校验和会失效。以下是最常见的:

Furthermore, there are routine cases where L4 checksums computed in hardware on received frames are invalidated. Here are the most common ones:

  • 当输入 L2 帧包含一些填充以达到最小帧大小,但 NIC 不够智能,无法在计算校验和时忽略填充。在这种情况下,硬件校验和将与接收 L4 层计算的校验和不匹配。您将在第 19 章的“处理输入 IP 数据包” 部分中看到,为了安全起见,在这种情况下该函数总是使校验和无效。在第四部分中,您将看到桥接代码可以执行类似的操作。ip_rcv

  • When an input L2 frame includes some padding to reach the minimum frame size, but the NIC was not smart enough to leave the padding out when computing the checksum. In this case, the hardware checksum won't match the one computed by the receiving L4 layer. You will see in the section "Processing Input IP Packets" in Chapter 19 that to be on the safe side, the ip_rcv function always invalidates the checksum in this case. In Part IV, you will see that the bridging code can do something similar.

  • 当输入 IP 片段与先前接收到的片段重叠时。参见 第 22 章

  • When an input IP fragment overlaps with a previously received fragment. See Chapter 22.

  • 当输入 IP 数据包使用任何 IPsec 套件协议时。在这种情况下,NIC 无法正确计算 L4 校验和,因为 L4 标头和有效负载已被压缩、摘要或加密。有关示例,请参阅 esp_inputnet /ipv4/esp4.c

  • When an input IP packet uses any of the IPsec suite's protocols. In such cases, the L4 checksum cannot have been computed correctly by the NIC because the L4 header and payload are either compressed, digested, or encrypted. For an example, see esp_input in net/ipv4/esp4.c.

  • 由于 NAT 或 IP 层的某些类似干预,需要重新计算校验和。例如,请参见ip_nat_fnnet /ipv4/netfilter/ip_nat_standalone.c

  • The checksum needs to be recomputed because of NAT or some similar intervention at the IP layer. See, for instance, ip_nat_fn in net/ipv4/netfilter/ip_nat_standalone.c.

尽管名称可能令人困惑,但该字段skb->ip_summed与 L4 校验和有关(更多详细信息请参阅第 19 章)。当 IP 层知道某些内容使 L4 校验和无效时(例如伪标头一部分的字段发生更改),它的值将由 IP 层进行操作。

Although the name might prove confusing, the field skb->ip_summed has to do with the L4 checksum (more details in Chapter 19). Its value is manipulated by the IP layer when it knows that something has invalidated the L4 checksum, such as a change in a field that is part of the pseudoheader.

我不会详细介绍如何计算本地生成的数据包的校验和。但是我们将在第 21 章的“将数据复制到片段中:getfrag ”一节中简要了解如何在创建片段时增量计算它。

I will not cover the details of how the checksum is computed for locally generated packets. But we will briefly see in the section "Copying data into the fragments: getfrag" in Chapter 21 how it can be computed incrementally while creating fragments.




[ * ]用于处理多播流量的函数未包含在图 18-1中(除了ip_mc_output)。该图包括主要的API;然而,还有一些在特定情况下使用的其他方法。参见第 21 章

[*] The functions used to handle multicast traffic are not included in Figure 18-1 (apart from ip_mc_output). The figure includes the main APIs; however, there are others that are used in specific cases. See Chapter 21.

[ * ]第 2 章中,您可以找到最常见接口使用的 MTU 表格。

[*] In Chapter 2, you can find a table with the MTU used by the most common interfaces.

[ * ]您可以在 IETF 网站 http://www.ietf.org/html.charters/OLD/diffserv-charter.html上找到有关 diffserv 的更多信息。(由于某些原因,带有 OLD 关键字的 URL 比不带 OLD 关键字的 URL 更新。)

[*] You can find more information about diffserv on the IETF web site, http://www.ietf.org/html.charters/OLD/diffserv-charter.html. (For some reasons, the URL with the OLD keyword is more up-to-date than the one without it.)

[ * ]默认值实际上取决于数据包是否是多播的。组播 IP 数据包的默认 TTL 为 1(可以通过 setsockopt系统调用更改)。

[*] The default value actually depends on whether the packet is multicast. Multicast IP packets have a default TTL of 1 (which can be changed with the setsockopt system call).

[ ]请注意,该文件不是内核的一部分,但包含在所有 Linux 发行版中。

[] Note that this file is not part of the kernel, but is included in all Linux distributions.

[ * ]更详细的列表,您可以参考http://www.iana.org/assignments/ip-parameters

[*] For a more detailed list, you can refer to http://www.iana.org/assignments/ip-parameters.

[ ]在第19章的“ IP选项” 部分,我们将看到如何使用。ip_forward_optionsIPOPT_COPIED

[] In the section "IP Options" in Chapter 19, we will see how ip_forward_options uses IPOPT_COPIED.

[ * ] (40-3)/4=9,其中 40 是 IP 选项的最大大小,3 是选项标头的大小,4 是 IPv4 地址的大小。

[*] (40-3)/4=9, where 40 is the maximum size of the IP options, 3 is the size of the options header, and 4 is the size of an IPv4 address.

[ * ]的值length不是 4 的精确倍数,因为选项标头(typelengthpointer)的长度为 3 个字节。这意味着 32 位 IP 地址不方便地跨 32 位字边界进行分割。

[*] The value of length is not an exact multiple of 4 because the option header (type, length, and pointer) is 3 bytes long. This means that the 32-bit IP addresses are inconveniently split across 32-bit word boundaries.

[ * ] UTC 代表通用时间时钟,也称为 GMT(格林威治标准时间)。

[*] UTC stands for Universal Time Clock, also called GMT (Greenwich Mean Time).

[ * ]第21章中的“ ip_append_data函数”部分展示了L3和L4之间的接口如何发展以优化本地生成的数据包的分段任务。

[*] The section "The ip_append_data Function" in Chapter 21 shows how the interface between L3 and L4 has evolved to optimize the fragmentation task for locally generated packets.

[ ]正如我们将在第 21 章的“组合传输功能”部分中看到的,L4 协议实际上提供了一些可以影响分段的选项。

[] As we will see in the section "Putting Together the Transmission Functions" in Chapter 21, L4 protocols actually provide some options that can influence fragmentation.

[ * ] Linux 所谓的伪装通常也称为端口地址转换 (PAT)。

[*] What Linux calls masquerading is also commonly called Port Address Translation (PAT).

[ ]请注意,由于来自 Internet 并寻址到内部网络中的主机的返回流量都将具有目标 IP 地址 140.105.1.1,因此 R 使用目标 UDP/TCP 端口号来查找要路由的正确内部主机的入口流量。对于我们的例子,我们不需要看这个端口业务是如何处理的。

[] Note that since the return traffic from the Internet and addressed to the hosts in the internal network will all have a destination IP address of 140.105.1.1, R uses the destination UDP/TCP port number to find the right internal host to route the ingress traffic to. We do not need to look at how this port business is handled for our example.

[ * ]更准确地说,路由缓存条目与多个参数的组合相关联,包括源 IP 地址、目标 IP 地址和 IP TOS。

[*] To be more exact, a routing cache entry is associated with a combination of several parameters, including the source IP address, the destination IP address, and the IP TOS.

[ ]我们将在第 31 章中看到,如果向内核添加对多路径路由的支持,则可以定义具有多个下一跳的路由,每个下一跳都可以通过不同的接口访问。

[] We will see in Chapter 31 that if you add support for multipath routing to the kernel, you can define routes with multiple next hops, each one of which can potentially be reachable with a different interface.

[ ]如果您对更多详细信息感兴趣,我建议您阅读 RFC 791、1191 和 2923。

[] If you are interested in more details, I suggest you read RFCs 791, 1191, and 2923.

[ * ]可以有多个路由到同一目的地,以实现冗余或负载平衡。

[*] There can be more than one route to the same destination, for redundancy or load balancing.

第 19 章 Internet 协议版本 4 (IPv4):Linux 基础和功能

Chapter 19. Internet Protocol Version 4 (IPv4): Linux Foundations and Features

前一章阐述了操作系统需要做什么来支持IP协议;本章介绍 Linux 支持 IP 的数据结构和基本活动,例如入口 IP 数据包如何传递到 IP 接收例程、如何验证校验和以及如何处理 IP 选项。

The previous chapter laid out what an operating system needs to do to support the IP protocol; this chapter introduces the data structures and basic activities through which Linux supports IP, such as how ingress IP packets are delivered to the IP reception routine, how the checksum is verified, and how IP options are processed.

主要 IPv4 数据结构

Main IPv4 Data Structures

本节介绍 IPv4 协议使用的主要数据结构。关于它们的字段的详细描述可以参考第23章。

This section introduces the major data structures used by the IPv4 protocol. You can refer to Chapter 23 for a detailed description of their fields.

我没有用图片来显示数据结构之间的关系,因为它们大多数都是独立的并且不保留交叉引用。

I have not included a picture to show the relationships among the data structures because most of them are independent and do not keep cross-references.

iphdr structure
iphdr structure

IP 标头。其字段的含义已在第18 章的“ IP 标头”部分中介绍过。

IP header. The meaning of its fields has already been covered in the section "IP Header" in Chapter 18.

ip_options structure
ip_options structure

该结构在include/linux/ip.h中定义,表示需要传输或转发的数据包的选项。选项存储在此结构中,因为它比 IP 标头本身的相应部分更容易读取。

This structure, defined in include/linux/ip.h, represents the options for a packet that needs to be transmitted or forwarded. The options are stored in this structure because it is easier to read than the corresponding portion of the IP header itself.

ipcm_cookie structure
ipcm_cookie structure

该结构结合了传输数据包所需的各种信息。

This structure combines various pieces of information needed to transmit a packet.

ipq structure
ipq structure

IP 数据包片段的集合。请参阅第 22 章中的“ IP 片段哈希表的组织”部分。

Collection of fragments of an IP packet. See the section "Organization of the IP Fragments Hash Table" in Chapter 22.

inet_peer structure
inet_peer structure

内核为最近与之通信的每个远程主机保留一个该结构的实例。在第 23 章的“长期 IP 对等信息” 部分中,您将看到它是如何使用的。结构的所有实例都保存在 AVL 树中,这是一种针对频繁查找而优化的结构。inet_peer

The kernel keeps an instance of this structure for each remote host it has been talking to in the recent past. In the section "Long-Living IP Peer Information" in Chapter 23 you will see how it is used. All instances of inet_peer structures are kept in an AVL tree, a structure optimized for frequent lookups.

ipstats_mib structure
ipstats_mib structure

简单网络管理协议 (SNMP) 使用一种称为管理信息库 (MIB) 的对象来收集有关系统的统计信息。一个称为ipstats_mibIP 层的数据结构保存有关 IP 层的统计信息第 23 章中的“ IP 统计”部分更详细地介绍了该结构。

The Simple Network Management Protocol (SNMP) employs a type of object called a Management Information Base (MIB) to collect statistics about systems. A data structure called ipstats_mib keeps statistics about the IP layer . The section "IP Statistics" in Chapter 23 covers this structure in more detail.

in_device structure
in_device structure

in_device结构存储网络设备的所有 IPv4 相关配置,例如用户使用 ifconfigip命令所做的更改。net_device该结构通过 和 链接到结构net_device->ip_ptr,并且可以使用 in_dev_get和进行检索_ _in_dev_get。这两个函数之间的区别在于,第一个函数负责所有必要的锁定,第二个函数假设调用者已经处理了它。

由于在成功时(即,当设备配置为支持 IPv4 时)in_dev_get会在内部增加结构上的引用计数in_dev,因此其调用者应该在in_dev_put 完成该结构时减少引用计数。

该结构通过 分配并链接到设备inetdev_init,当在设备上配置第一个 IPv4 地址时调用该结构。

The in_device structure stores all the IPv4-related configuration for a network device, such as changes made by a user with the ifconfig or ip command. This structure is linked to the net_device structure via net_device->ip_ptr and can be retrieved with in_dev_get and _ _in_dev_get. The difference between those two functions is that the first one takes care of all the necessary locking, and the second one assumes the caller has taken care of it already.

Since in_dev_get internally increases a reference count on the in_dev structure when it succeeds (i.e., when a device is configured to support IPv4), its caller is supposed to decrement the reference count with in_dev_put when it is done with the structure.

The structure is allocated and linked to the device with inetdev_init, which is called when the first IPv4 address is configured on the device.

in_ifaddr structure
in_ifaddr structure

在接口上配置 IPv4 地址时,内核会创建一个in_ifaddr包含 4 字节地址以及其他几个字段的结构。

When configuring an IPv4 address on an interface, the kernel creates an in_ifaddr structure that includes the 4-byte address along with several other fields.

ipv4_devconf structure
ipv4_devconf structure

该数据结构的字段通过/proc/sys/net/ipv4/conf/中的/procipv4_devconf导出,用于调整网络设备的行为。每个设备都有一个实例,以及一个存储默认值 ( ) 的实例。第 28 章和36章介绍了其字段的含义。ipv4_devconf_dflt

The ipv4_devconf data structure, whose fields are exported via /proc in /proc/sys/net/ipv4/conf/, is used to tune the behavior of a network device. There is an instance for each device, plus one that stores the default values (ipv4_devconf_dflt). The meanings of its fields are covered in Chapters 28 and 36.

ipv4_config structure
ipv4_config structure

虽然ipv4_devconf结构用于存储每个设备的配置,但ipv4_config 存储适用于主机的配置。

While ipv4_devconf structures are used to store per-device configuration, ipv4_config stores configuration that applies to the host.

cork
cork

cork结构体用于处理套接字CORK选项我们将在第 21 章中看到如何使用它的字段在连续调用中维护一些上下文信息ip_append_dataip_append_page处理数据碎片。

The cork structure is used to handle the socket CORK option . We will see in Chapter 21 how its fields are used to maintain some context information across consecutive invocations of ip_append_data and ip_append_page to handle data fragmentation.

sk_buff 和 net_device 结构中与校验和相关的字段

Checksum-Related Fields from sk_buff and net_device Structures

我们看到了用于计算 IP 和 L4 校验和的例程在第 18 章的“校验和”部分中。在本节中,我们将了解缓冲区结构的哪些字段用于存储有关校验和的信息,设备如何向内核告知其硬件校验和功能,以及 L4 协议如何使用此类信息来决定是否计算入口和的校验和。传出数据包或让网络接口卡 (NIC) 执行此操作。sk_buff

We saw the routines used to compute the IP and L4 checksums in the section "Checksums" in Chapter 18. In this section, we will see what fields of the sk_buff buffer structure are used to store information about checksums, how devices tell the kernel about their hardware checksumming capabilities, and how the L4 protocols use such information to decide whether to compute the checksum for ingress and egress packets or to let the network interface cards (NICs) do it.

由于 IP 校验和始终由内核在软件中计算和验证,因此接下来的小节将重点介绍 L4 校验和处理和问题。

Because the IP checksum is always computed and verified in software by the kernel, the next subsections concentrate on L4 checksum handling and issues.

网络设备结构

net_device structure

net_device->features字段指定设备的功能。在可以设置的各种标志中,有一些用于定义设备的硬件校验和功能。可能的功能列表位于include/linux/netdevice.h自身的定义中 net_device。以下是用于控制校验和的标志:

The net_device->features field specifies the capabilities of the device. Among the various flags that can be set, a few are used to define the device's hardware checksumming capabilities. The list of possible features is in include/linux/netdevice.h inside the definition of net_device itself. Here are the flags used to control checksumming:

NETIF_F_NO_CSUM
NETIF_F_NO_CSUM

该设备非常可靠,无需使用任何 L4 校验和。例如,在环回设备上启用此功能。

The device is so reliable that there is no need to use any L4 checksum. This feature is enabled, for instance, on the loopback device.

NETIF_F_IP_CSUM
NETIF_F_IP_CSUM

该设备可以在硬件中计算 L4 校验和,但仅限于 IPv4 上的 TCP 和 UDP。

The device can compute the L4 checksum in hardware, but only for TCP and UDP over IPv4.

NETIF_F_HW_CSUM
NETIF_F_HW_CSUM

该设备可以在硬件中计算任何协议的 L4 校验和。此功能较不常见NETIF_F_IP_CSUM

The device can compute the L4 checksum in hardware for any protocol. This feature is less common than NETIF_F_IP_CSUM.

sk_buff结构

sk_buff structure

这两个字段 skb->csumskb->ip_summed具有不同的含义,具体取决于是skb指向接收到的数据包还是指向要发送出去的数据包。

The two fields skb->csum and skb->ip_summed have different meanings depending on whether skb points to a received packet or to a packet to be transmitted out.

当接收到数据包时,skb->csum可能会保留其 L4 校验和。这个命名奇怪的skb->ip_summed字段跟踪 L4 校验和的状态。状态由以下值指示,在include/linux/skbuff.h中定义。以下定义表示设备驱动程序告诉 L4 层的内容。一旦 L4 接收例程接收到缓冲区,它可能会更改 的初始化skb->ip_summed

When a packet is received, skb->csum may hold its L4 checksum. The oddly named skb->ip_summed field keeps track of the status of the L4 checksum. The status is indicated by the following values, defined in include/linux/skbuff.h. The following definitions represent what the device driver tells the L4 layer. Once the L4 receive routine receives the buffers, it may change the initialization of skb->ip_summed.

CHECKSUM_NONE
CHECKSUM_NONE

输入的校验csum和无效。这可能是由于多种原因造成的:

  • 该设备不提供硬件校验和。

  • 设备计算硬件校验和并发现帧已损坏。此时,设备驱动程序可以直接丢弃该帧。但某些设备驱动程序更喜欢设置ip_summedCHECKSUM_NONE 并让软件再次计算和验证校验和。这是不幸的,因为在接收数据包的所有开销之后,内核所做的就是重新检查校验和并丢弃数据包(请参阅 e1000_rx_checksumdrivers /net/e1000/e1000_main.c)。请注意,如果要转发输入帧,路由器不应由于错误的 L4 校验和而丢弃它(路由器不应该查看 L4 校验和)。这将由目标主机来完成。这是设备驱动程序不丢弃未通过 L4 校验和的帧,而是让 L4 接收例程验证它们的另一个原因。

  • 校验和需要重新计算和重新验证。 有关最常见的原因,请参阅第 18 章中的“ L4 校验和的更改”部分。

The checksum in csum is not valid. This can be due to various reasons:

  • The device does not provide hardware checksumming.

  • The device computed the hardware checksums and found the frame to be corrupted. At this point, the device driver could discard the frame directly. But some device drivers prefer to set ip_summed to CHECKSUM_NONE and let the software compute and verify the checksum again. This is unfortunate, because after all of the overhead of receiving the packet, all that the kernel does is recheck the checksum and discard the packet (see e1000_rx_checksum in drivers/net/e1000/e1000_main.c). Note that if the input frame is to be forwarded, the router should not discard it due to a wrong L4 checksum (a router is not supposed to look at the L4 checksum). It will be up to the destination host to do it. This is another reason why device drivers do not discard frames that fail the L4 checksum, but let the L4 receive routine verify them.

  • The checksum needs to be recomputed and reverified. See the section "Changes to the L4 Checksum" in Chapter 18 for the most common reasons.

CHECKSUM_HW
CHECKSUM_HW

NIC 计算了 L4 标头和有效负载的校验和,并将其复制到skb->csum字段中。软件(即L4接收例程)仅需要添加伪报头上的校验skb->csum和并验证所得到的校验和。该标志可以被认为是以下标志的特殊情况。

The NIC has computed the checksum on the L4 header and payload and has copied it into the skb->csum field. The software (i.e., the L4 receive routine) needs only to add the checksum on the pseudoheader to skb->csum and to verify the resulting checksum. This flag can be considered a special case of the following flag.

CHECKSUM_UNNECESSARY
CHECKSUM_UNNECESSARY

NIC 已经计算并验证了 L4 报头和校验和以及伪报头上的校验和(伪报头上的校验和可以选择由软件中的设备驱动程序计算),因此软件无需执行任何 L4 操作校验和验证。

The NIC has computed and verified the checksum on the L4 header and checksum, as well as on the pseudoheader (the checksum on the pseudoheader may optionally be computed by the device driver in software), so the software is relieved from having to do any L4 checksum verification.

CHECKSUM_UNNECESSARY也可以设置,例如,当错误的概率非常低并且计算和验证 L4 校验和会浪费时间和 CPU 能力时。一个例子是环回设备:由于通过该虚拟设备发送的数据包永远不会离开本地主机,因此唯一可能的错误是由于 RAM 故障或操作系统中的错误造成的。因此,此选项可以与此类特殊设备一起使用,但标准行为是计算每个接收到的数据包的校验和并在接收端丢弃损坏的数据包。

CHECKSUM_UNNECESSARY can also be set, for example, when the probability of an error is very low and it would be a waste of time and CPU power to compute and verify the L4 checksum. One example is the loopback device: since the packets sent through this virtual device never leave the local host, the only possible errors would be due to faulty RAM or bugs in the operating system. This option can therefore be used with such special devices, but the standard behavior is to compute the checksum of each received packet and discard corrupted packets at the receiving end.

传输数据包时, csum表示指向缓冲区内位置的指针(或更准确地说是偏移量),硬件卡必须在该位置放置将要计算的校验和,而不是校验和本身。因此,仅当校验和是在硬件中计算时,才在数据包传输期间使用该字段。L4 和 L2 之间的这种绕过 L3 的交互引入了一些需要处理的额外问题。例如,诸如网络地址转换 (NAT) 之类的功能会操纵 L4 层使用的 IP 标头字段来计算伪标头上的所谓校验和,从而使该数据结构无效(请参阅“L4 层的更改”部分)校验和“中第 18 章)。

When a packet is transmitted, csum represents a pointer (or more accurately, an offset) to the place inside the buffer where the hardware card has to put the checksum it will compute, not the checksum itself. This field is therefore used during packet transmission only if the checksum is calculated in hardware. This interaction between L4 and L2, bypassing L3, introduces a couple of additional problems to deal with. For example, a feature such as Network Address Translation (NAT) that manipulates the fields of the IP header used by the L4 layer to compute the so-called checksum on the pseudoheader would invalidate that data structure (see the section "Changes to the L4 Checksum" in Chapter 18).

与接收情况一样,ip_summed表示 L4 校验和的状态。L4 协议使用该字段告诉设备是否需要处理校验和。特别是,这是传输期间的含义ip_summed

As in the case of reception, ip_summed represents the status of the L4 checksum. The field is used by the L4 protocols to tell the device whether it needs to take care of checksumming. In particular, this is the meaning of ip_summed during transmissions:

CHECKSUM_NONE
CHECKSUM_NONE

协议已经处理好校验和;该设备不需要执行任何操作。当您转发入口帧时,L4 校验和已经准备就绪,因为它已由发送方主机计算出来;因此,无需计算它。参见第 20ip_forward章 。当设置为时,无意义。ip_summedCHECKSUM_NONEcsum

The protocol has already taken care of the checksum; the device does not need to do anything. When you forward an ingress frame, the L4 checksum is already ready because it has been computed by the sender host; therefore, there is no need to compute it. See ip_forward in Chapter 20. When ip_summed is set to CHECKSUM_NONE, csum is meaningless.

CHECKSUM_HW
CHECKSUM_HW

协议仅将伪标头的校验和存储到其标头中;设备应该通过在 L4 标头和负载上添加校验和来完成此操作。

The protocol has stored into its header the checksum on the pseudoheader only; the device is supposed to complete it by adding the checksum on the L4 header and payload.

ip_summed传输数据包时不使用该 CHECKSUM_UNNECESSARY值(相当于CHECKSUM_NONE)。

ip_summed does not use the CHECKSUM_UNNECESSARY value when transmitting packets (it would be equivalent to CHECKSUM_NONE).

虽然功能标志在 NIC 启用时由设备驱动程序初始化,但必须为接收或传输的每个缓冲区设置标志。在接收时,设备驱动程序 根据设备功能正确初始化。NETIF_F_ XXX _CSUMCHECKSUM_ XXXsk_buffip_summedNETIF_F_ XXX _CSUM

While the feature flags NETIF_F_ XXX _CSUM are initialized by the device driver when the NIC is enabled, the CHECKSUM_ XXX flags have to be set for every sk_buff buffer that is received or transmitted. At reception time, it is the device driver that initializes ip_summed correctly based on the NETIF_F_ XXX _CSUM device capabilities.

在传输时,L3 传输 APIip_summed根据出口设备的校验和功能进行初始化,该功能可以从路由表中导出:与目的地匹配的路由表缓存条目包含有关出口设备的信息,因此包含其校验和功能(请ip_append_data 参阅示例)。

At transmission time, the L3 transmission APIs initialize ip_summed based on the checksumming capabilities of the egress device, which can be derived from the routing table: the routing table cache entry that matches the destination includes information about the egress device, and therefore its checksumming capabilities (see ip_append_data for an example).

skb->csum考虑到前面描述的和skb->ip_summed 字段以及标志的含义CHECKSUM_HW,您可以研究 TCPv4 如何处理tcp_v4_checksum_init中入口段的校验和以及tcp_v4_send_check中出口段的校验和。

Given the meaning of the skb->csum and skb->ip_summed fields and the CHECKSUM_HW flag previously described, you can study, for example, how TCPv4 takes care of the checksum on ingress segments in tcp_v4_checksum_init, and the checksum of egress segments in tcp_v4_send_check.

一般数据包处理

General Packet Handling

本章的其余部分介绍了内核在处理入口 IP 数据包时必须考虑的一些一般注意事项,例如校验和和选项。后续章节将详细介绍它们如何转发、传输以及分片/碎片整理。

The rest of this chapter covers some general considerations that the kernel has to take into account when handling ingress IP packets, such as checksumming and options. Subsequent chapters go into detail about how they are forwarded, transmitted, and fragmented/defragmented.

协议初始化

Protocol Initialization

IPv4协议由 初始化,在net/ipv4/ip_output.cip_init中定义。由于无法从内核中删除 IPv4 支持(即无法将其编译为模块),因此没有任何功能。ip_uninit

The IPv4 protocol is initialized by ip_init, defined in net/ipv4/ip_output.c. Because IPv4 support cannot be removed from the kernel (i.e., it cannot be compiled as a module), there is no ip_uninit function.

以下是完成的主要任务ip_init

Here are the main tasks accomplished by ip_init:

ip_init在启动时由 调用inet_init,它负责所有与 IPv4 相关的子系统的初始化,包括 L4 协议。

ip_init is invoked at boot time by inet_init, which takes care of the initialization of all the subsystems related to IPv4, including the L4 protocols.

与 Netfilter 交互

Interaction with Netfilter

我们不会在本书中研究 Netfilter 防火墙子系统,但我们现在可以研究它的主要工作原理,特别是它与我们在本书这一部分讨论的 IPv4 实现方面的关系。

We will not examine the Netfilter firewalling subsystem in this book, but we can examine its main working principles now, particularly its relationship to the aspects of the IPv4 implementation we discuss in this part of the book.

本质上,防火墙挂钩网络堆栈代码中的某些位置,当数据包或内核满足某些条件时,数据包总是通过这些位置;在这些点上,防火墙允许网络管理员操纵流量的内容或处置。内核中的这些点,如第 18 章中的图 18-1所示,包括:

Firewalling, essentially, hooks into certain places in the network stack code that packets always pass through when the packets or the kernel meet certain conditions; at those points, the firewall allows network administrators to manipulate the contents or disposition of the traffic. Those points in the kernel, as shown in Figure 18-1 in Chapter 18, include:

  • 数据包接收

  • Packet reception

  • 数据包转发(路由决策之前)

  • Packet forwarding (before routing decision)

  • 数据包转发(路由决策后)

  • Packet forwarding (after routing decision)

  • 数据包传输

  • Packet transmission

区分预布线和后布线有用的原因将在第七部分中变得更加清楚。

The reason why it is useful to distinguish between pre-routing and post-routing will become clearer in Part VII.

在刚刚列出的每种情况下,负责操作的函数都分为两部分,通常称为do_something和 。(在某些情况下,名称是 和。)仅包含一些健全性检查,也许还有一些内务处理。真正起作用的代码位于or中。最后调用Netfilter函数,传入调用的来源点(例如,数据包接收)以及如果用户使用iptables配置的过滤规则要执行的函数do_something _finishdo_somethingdo_something 2do_somethingdo_something _finishdo_something 2do_somethingNF_HOOK命令不决定丢弃或拒绝数据包。如果没有可应用的规则或者它们只是指示“继续”, 则执行该函数。鉴于以下一般调用:do_something _finish

In each case just listed, the function in charge of the operation is split into two parts, usually called do_something and do_something _finish. (In a few cases, the names are do_something and do_something 2.) do_something contains only some sanity checks and maybe some housekeeping. The code that does the real job is in do_something _finish or do_something 2. do_something ends by calling the Netfilter function NF_HOOK, passing in the point where the call comes from (for instance, packet reception) and the function to execute if the filtering rules configured by the user with the iptables command do not decide to drop or reject the packet. If there are no rules to apply or they simply indicate "go ahead," the function do_something _finish is executed. Given the following general call:

NF_HOOK(协议,HOOK_POSITION_IN_THE_STACK,SKB_BUFFER,IN_DEVICE,OUT_DEVICE,do_
某事_完成)
NF_HOOK(PROTOCOL, HOOK_POSITION_IN_THE_STACK, SKB_BUFFER, IN_DEVICE, OUT_DEVICE, do_
something_finish)

的输出值NF_HOOK可以是以下之一:

the output value of NF_HOOK can be one of the following:

  • 后者执行时的输出值do_something _finish

  • The output value of do_something _finish when the latter is executed

  • -EPERMifSKB_BUFFER由于过滤器而被丢弃

  • -EPERM if SKB_BUFFER is dropped because of a filter

  • -ENOMEM如果没有足够的内存来执行过滤操作

  • -ENOMEM if there was insufficient memory to perform the filtering operation

在本章中,我们不需要担心这些细节。我们假设没有配置过滤器,因此,在 结束时 do_something,对 Netfilter 函数的调用将简单地执行。我们将在函数末尾看到第一个示例。do_something _finiship_rcv

In this chapter, we do not need to worry about those details. We will assume that no filters are configured and therefore that, at the end of do_something, the call to the Netfilter function will simply execute do_something _finish. We will see the first example at the end of the ip_rcv function.

与路由子系统的交互

Interaction with the Routing Subsystem

IP层需要在多个地方与路由表进行交互,例如在接收和发送数据包时。我将在第七部分描述路由子系统时介绍有关路由的详细信息;现在我将简要描述 IP 层用来查询路由表的三个函数:

The IP layer needs to interact with the routing table in several places, such as when receiving and when transmitting a packet. I will cover the details about routing in Part VII when I will describe the routing subsystem; for now I'll just briefly describe three of the functions used by the IP layer to consult the routing table:

ip_route_input
ip_route_input

确定输入数据包的命运。正如第 18 章中的图 18-1所示,数据包可以在本地传送、转发或丢弃。

Determines the destiny of an input packet. As you can see in Figure 18-1 in Chapter 18, the packet could be delivered locally, forwarded, or dropped.

ip_route_output_flow
ip_route_output_flow

在传输数据包之前使用。此函数返回要使用的下一跳网关和出口设备。

Used before transmitting a packet. This function returns both the next hop gateway and the egress device to use.

dst_pmtu
dst_pmtu

给定路由表缓存条目,返回关联的路径最大传输单元 (PMTU)。

Given a routing table cache entry, returns the associated Path Maximum Transmission Unit (PMTU).

第 33 章和35ip_route_ xxx章详细描述了这些功能,它们查阅路由表并根据一组字段做出决策:

The ip_route_ xxx functions, described in detail in Chapters 33 and 35, consult the routing table and base their decisions on a set of fields:

  • 目的IP地址。

  • Destination IP address.

  • 源IP地址。

  • Source IP address.

  • 服务类型 (ToS)。

  • Type of Service (ToS).

  • 接收情况下的接收装置。

  • Receiving device in the case of reception.

  • 允许的传输设备列表。

  • List of allowed transmitting devices.

可能影响这些功能返回的决策的更复杂的因素包括策略路由的存在和防火墙的存在。

Among the more complex factors that could influence the decision returned by these functions are the presence of policy routing and the presence of a firewall.

这些函数将路由表查找的结果存储在 中skb->dst。该结构包括几个字段,包括inputoutput函数指针,将调用它们来完成数据包的接收或发送(请参见第 18 章中的图 18-1了解这两个函数指针的使用位置)。如果查找失败,这些函数将返回负值。ip_route_ xxx

The functions store the result of the routing table lookup in skb->dst. This structure includes several fields, including the input and output function pointers that will be called to complete the reception or the transmission of the packet (see Figure 18-1 in Chapter 18 for where those two function pointers are used). The ip_route_ xxx functions return a negative value if the lookup fails.

这两个函数还使用缓存将数据包流快速发送到同一目的地。目的IP地址是做出决定的最重要标准,并用作缓存中的搜索关键字。但每个缓存条目还包括几个其他参数,用于区分使用哪个路由。例如,缓存跟踪每个路由的 PMTU,这在第 18 章的“路径 MTU 发现”部分中进行了描述。

Both functions also use a cache to get a stream of packets to the same destination quickly. The destination IP address is the most important criterion for making the decision, and is used as the search key into the cache. But each cache entry also includes several other parameters that distinguish which route is used. For instance, the cache keeps track of each route's PMTU, which was described in the section "Path MTU Discovery" in Chapter 18.

处理输入 IP 数据包

Processing Input IP Packets

第 13 章展示了内核通过调用该协议注册的处理函数将每个级别的流量路由到正确的协议。在该章的“协议处理程序注册”部分中,我们了解了 IP 协议如何向内核注册其ip_rcvnet/ipv4/ip_input.c中定义的协议处理程序。现在我们可以从函数开始分析 IP 数据包在内核网络堆栈内的路径 ip_rcv

Chapter 13 showed that the kernel routes traffic at every level to the proper protocol by invoking the handler function registered by that protocol. In the section "Protocol Handler Registration" in that chapter, we saw how the IP protocol registers its protocol handler ip_rcv, defined in net/ipv4/ip_input.c, with the kernel. We can now start to analyze the path of IP packets inside the kernel network stack, starting with the ip_rcv function.

ip_rcv是“与 Netfilter 交互”部分中描述的两阶段过程的经典案例。它的工作仅包括对数据包进行健全性检查,然后调用 Netfilter 挂钩。大多数处理将在ip_rcv_finishNetfilter 挂钩中调用。

ip_rcv is a classic case of the two-stage process described in the section "Interaction with Netfilter." Its work consists just of applying sanity checks to the packet and then invoking the Netfilter hook. Most processing will take place in ip_rcv_finish, called from the Netfilter hook.

这是 的原型ip_rcv。不使用第三个输入参数。

Here is the prototype of ip_rcv. The third input parameter is not used.

int ip_rcv(结构 sk_buff *skb, 结构 net_device *dev, 结构 packet_type *pt)
int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt)

netif_receive_skb函数( 第 10 章中描述)将指针设置为skb->nhL2 标头末尾的 L3 协议 ( )。因此,IP 层函数可以安全地将其转换为iphdr结构。

The netif_receive_skb function (described in Chapter 10) sets the pointer to the L3 protocol (skb->nh) at the end of the L2 header. IP layer functions can therefore safely cast it to an iphdr structure.

sk_buff如前几章所述ip_rcv,在从 NIC 发出中断通知到调用 L3 协议处理程序的一系列事件期间, 的大部分字段都是在调用 之前设置的。图19-1sk_buff显示了启动时一些字段的值ip_rcv。注意skb->data,通常用于指向有效负载,这里指向L3标头。

Most of the fields of sk_buff are set before the call to ip_rcv, as explained in previous chapters, during the sequence of events that take place from the interrupt notification by an NIC to the invocation of the L3 protocol handler. Figure 19-1 shows the values of some of the sk_buff fields when ip_rcv starts. Note that skb->data, which is usually used to point to the payload, here points to the L3 header.

ip_rcv开头的sk_buff数据结构的一部分

图 19-1。ip_rcv开头的sk_buff数据结构的一部分

Figure 19-1. Part of sk_buff data structure at the beginning of ip_rcv

第10章和第13章中我们看到了NIC的设备驱动程序如何设置L3协议标识符和skb->protocol数据包类型skb->pkt_type。例如,以太网驱动程序通过该eth_type_trans 函数来实现这一点。

In Chapter 10 and Chapter 13 we saw how the NIC's device driver sets the L3 protocol identifier skb->protocol and the packet type skb->pkt_type. Ethernet drivers, for instance, do that by means of the eth_type_trans function.

skb->pkt_typePACKET_OTHERHOST当帧的 L2 目标地址与接收接口的地址不同时设置为。通常这些数据包会被 NIC 本身丢弃。但是,如果接口已进入混杂模式,则无论目标 L2 地址如何,它都会接收所有数据包,并将它们传递到更高层。内核调用请求访问所有数据包的嗅探器,如第 10 章所述。但 ip_rcv不关心其他地址的数据包并简单地丢弃它们:

skb->pkt_type is set to PACKET_OTHERHOST when the L2 destination address of the frame is different from the address of the receiving interface. Normally those packets are discarded by the NIC itself. However, if the interface has been put into promiscuous mode, it receives all packets regardless of the destination L2 address and passes them up to higher layers. The kernel invokes sniffers that have requested access to all packets, as described in Chapter 10. But ip_rcv is not concerned with packets for other addresses and simply drops them:

    if (skb->pkt_type == PACKET_OTHERHOST)
        转到下降;
    if (skb->pkt_type == PACKET_OTHERHOST)
        goto drop;

请注意,接收不同 L2 地址的数据包与接收应路由到另一个系统的数据包不同。在后一种情况下,数据包具有接口的 L2 地址,但具有与当前接收者的 L3 层地址不同的 L3 层地址。路由器被配置为接受此类数据包并路由它们,如第七部分中所述。

Note that receiving a packet for a different L2 address is not the same as receiving a packet that should be routed to another system. In the latter case, the packet has the interface's L2 address but an L3 layer address that is different from that of the current recipient. A router is configured to accept such packets and route them, as described in Part VII.

skb_share_check检查数据包的引用计数是否大于1,这意味着内核的其他部分有对缓冲区的引用。正如前面章节中所讨论的,嗅探器和其他用户可能对数据包感兴趣,因此每个数据包都包含一个引用计数。该netif_receive_skb函数(即调用 的函数 ip_rcv)在调用协议处理程序之前会增加引用计数。如果处理程序发现引用计数大于 1,它将创建自己的缓冲区副本,以便可以修改数据包。任何后续处理程序都将接收原始的、未更改的缓冲区。如果需要副本但内存分配失败,则数据包将被丢弃。

skb_share_check checks whether the reference count of the packet is bigger than 1, which means that other parts of the kernel have references to the buffer. As discussed in earlier chapters, sniffers and other users might be interested in packets, so each packet contains a reference count. The netif_receive_skb function, which is the one that calls ip_rcv, increments the reference count before it calls a protocol handler. If the handler sees a reference count bigger than 1, it creates its own copy of the buffer so that it can modify the packet. Any following handlers will receive the original, unchanged buffer. If a copy is needed but memory allocation fails, the packet is dropped.

    if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL) {
            IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS);
            转到出去;
    }
    if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL) {
            IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS);
            goto out;
    }

的工作pskb_may_pull是确保 指向的区域skb->data包含至少与 IP 标头一样大的数据块,因为每个 IP 数据包(包括片段)必须包含完整的 IP 标头。如果满足条件,则无需执行任何操作。skb_shinfo(skb)->frags[]. 否则,将从存储在[ * ]中的数据片段(如果有)复制丢失的部分 。如果此操作失败,则函数会因错误而终止。如果成功,该函数必须iph再次初始化,因为pskb_may_pull可能会更改缓冲区结构。

The job of pskb_may_pull is to make sure that the area pointed to by skb->data contains a block of data at least as big as the IP header, since each IP packet (fragments included) must include a complete IP header. If the condition is met, there is nothing to do. Otherwise, the missing part is copied from the data fragments (if any) stored in skb_shinfo(skb)->frags[]. [*] If this fails, the function terminates with an error. If it succeeds, the function must initialize iph again because pskb_may_pull could change the buffer structure.

    if (!pskb_may_pull(skb, sizeof(struct iphdr)))
        转到inhdr_error;
    iph = skb->nh.iph;
    if (!pskb_may_pull(skb, sizeof(struct iphdr)))
        goto inhdr_error;
    iph = skb->nh.iph;

接下来是对 IP 标头的一些健全性检查。基本IP头的大小为20字节,由于头中存储的大小以32位(4字节)的倍数表示,因此如果其值小于5则意味着存在错误。声明中的第二个检查if相当挑剔。目前IP协议有两个版本:IPv4和IPv6。该if语句确保数据包是 IPv4 数据包。但由于这两个协议由两个不同的函数处理,因此ip_rcv从一开始就不应该为 IPv6 调用该函数。

Next come some sanity checks on the IP header. The size of a basic IP header is 20 bytes, and since the size stored in the header is expressed in multiples of 32 bits (4 bytes), if its value is smaller than 5 it means there is an error. The second check in the if statement is rather fussy. Currently there are two versions of the IP protocol: IPv4 and IPv6. The if statement makes sure the packet is an IPv4 packet. But because the two protocols are handled by two different functions, the ip_rcv function should never have been called for IPv6 in the first place.

    if (iph->ihl < 5 || iph->版本!= 4)
        转到inhdr_error;
    if (iph->ihl < 5 || iph->version != 4)
        goto inhdr_error;

现在我们重复与之前相同的检查,但这次我们使用完整的 IP 标头大小(包括选项)。如果 IP 标头声明的大小为iph->ihl,则数据包的长度应至少为iph->ihl。这项检查一直保留到现在,因为该函数首先需要确保基本标头(即没有选项的标头)没有被截断,并且在从中读取内容之前通过基本健全性检查(在本例中)ihl

Now we repeat the same check as before, but this time we use the full IP header size (including the options). If the IP header claims a size of iph->ihl, the packet should be at least as long as iph->ihl. This check was left until now because the function needs first to make sure the basic header (i.e., the header without options) has not been truncated and that it passes a basic sanity check before reading something from it (ihl in this case).

    if (!pskb_may_pull(skb, iph->ihl*4))
        转到inhdr_error;
    iph = skb->nh.iph;
    if (!pskb_may_pull(skb, iph->ihl*4))
        goto inhdr_error;
    iph = skb->nh.iph;

执行完这两个协议一致性检查后,该函数需要计算校验和并查看其是否与标头中携带的校验和相匹配。如果没有,数据包就会被丢弃。该例程在第18 章的“用于校验和计算的APIip_fast_csum 部分中介绍过。

After these two protocol consistency checks have been performed, the function needs to compute the checksum and see whether it matches the one carried in the header. If it doesn't, the packet is dropped. The ip_fast_csum routine was introduced in the section "APIs for Checksum Computation" in Chapter 18.

    if (ip_fast_csum((u8 *)iph, iph->ihl) != 0)
        转到inhdr_error;
    if (ip_fast_csum((u8 *)iph, iph->ihl) != 0)
        goto inhdr_error;

校验和之后,还有另外两项健全性检查:

After the checksum, there are two other sanity checks:

  • 确保缓冲区的长度(即接收到的数据包)大于或等于 IP 标头中报告的长度。

  • Make sure the length of the buffer (i.e., the received packet) is greater than or equal to the length reported in the IP header.

  • 确保数据包的大小至少与 IP 标头大小一样大。

           {
            _ _u32 len = ntohs(iph->tot_len);
            if (skb->len < len || len < (iph->ihl<<2))
                转到inhdr_error;
  • Make sure the size of the packet is at least as large as the IP header size.

           {
            _ _u32 len = ntohs(iph->tot_len);
            if (skb->len < len || len < (iph->ihl<<2))
                goto inhdr_error;

这里我们需要解释一下为什么需要这两项检查。第一个原因是 L2 协议(例如以太网)可以填充有效负载[ * ] ,因此 IP 有效负载后可能会有额外的字节。(例如,当帧的 L2 大小小于协议所需的最小值时,就会发生这种情况。以太网帧的最小帧长度为 64 字节。)在这种情况下,数据包看起来会比报告的长度大在 IP 标头中。不同的大小和填充如图 19-2所示。

Here we need to explain why those two checks are needed. The first one arises from the fact that the L2 protocols (e.g., Ethernet) can pad out the payload,[*] so there may be extra bytes after the IP payload. (This happens, for instance, when the L2 size of the frame is smaller than the minimum required by the protocol. Ethernet frames have a minimum frame length of 64 bytes.) In such a case, the packet would look bigger than the length reported in the IP header. The different sizes and padding are shown in Figure 19-2.

达到最小有效负载大小所需的 L2 填充

图 19-2。达到最小有效负载大小所需的 L2 填充

Figure 19-2. L2 padding needed to reach the minimum payload size

第二个检查源自以下事实:IP 标头无法分段,因此每个 IP 分段必须至少包含一个 IP 标头。[ * ]条件中的原因<<2是标头 ( iph->ihl) 的大小以 32 位为单位测量。仅在极少数情况下此检查才会失败。这意味着校验和是在损坏的数据包上计算的,但偶然产生了与原始数据包相同的校验和(即,校验和没有检测到错误)。

The second check derives from the fact that an IP header cannot be fragmented, and that each IP fragment must therefore contain at least an IP header.[*] The reason for the <<2 in the condition is that the size of the header (iph->ihl) is measured in units of 32 bits. This check should fail only in an extremely rare situation. It would mean that the checksum had been computed on a corrupted packet but happened by chance to produce the same checksum as the original packet (i.e., the checksum did not detect the error).

与路由关联的最小 MTU 实际上是 68,它来自 RFC 791。由于 IP 标头最长可达 60 字节 (20+40),并且最小片段长度(最后一个片段除外)为 8字节,因此每个 IP 路由器必须能够转发 68 字节的 IP 数据包而不进行任何进一步的分段。

The minimum MTU associated with a route is in fact 68, which comes from RFC 791. Since the IP header can be up to 60 bytes long (20+40) and the minimum fragment length (with the exception of the last one) is 8 bytes, it follows that every IP router must be able to forward an IP packet of 68 bytes without any further fragmentation.

正如您可以想象的那样,我们到目前为止看到的以及稍后将看到的所有健全性检查对于系统的稳定性都非常重要。如果偶然该 sk_buff结构初始化不正确,或者 IP 标头本身已损坏,则内核可能会以错误的方式处理数据包或访问无效的内存位置,这可能会间接导致崩溃。

As you can imagine, all of the sanity checks that we have seen so far and that we will see later are very important for the stability of the system. If, by chance, the sk_buff structure was incorrectly initialized, or if the IP header itself was corrupted, the kernel could process packets in a wrong way or could access invalid memory locations, which could indirectly cause a crash.

我们说过,L2 协议可以填充数据包以达到特定的最小长度。该函数pskb_trim_rcsum检查是否发生了这种情况,如果发生了,则将数据包修剪到正确的大小,并使_ _pskb_trimL4 校验和无效,以防它是由接收 NIC 计算的。_ _pskb_trim有点复杂,因为它可能也需要处理碎片缓冲区。[ ]

We said that the L2 protocols could have padded out the packet to reach a specific minimum length. The function pskb_trim_rcsum checks whether that happened and, if it did, trims the packet to the right size with _ _pskb_trim and invalidates the L4 checksum in case it had been computed by the receiving NIC. _ _pskb_trim is slightly complex because it may need to deal with fragmented buffers, too.[]

当 L4 校验和由网卡在硬件中计算时,如果该卡不够智能,无法将 L2 填充省略,则它可能会包含 L2 填充。由于这里无法知道情况是否如此,为了安全起见,pskb_trim_rcsum只需使校验和无效并强制 L4 协议重新计算它。有关详细信息,请参阅第 18 章中的“校验和”部分。

When the L4 checksum is computed in hardware by the network card, it could include the L2 padding if the card is not smart enough to leave it out. Since here there is no way to know whether that was the case, to be on the safe side, pskb_trim_rcsum simply invalidates the checksum and forces the L4 protocol to recompute it. See the section "Checksums" in Chapter 18 for more details.

        如果(pskb_trim_rcsum(skb,len)){
            IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS);
            转到下降;
        }
    }
        if (pskb_trim_rcsum(skb, len)) {
            IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS);
            goto drop;
        }
    }

最后我们到达了函数的结尾。请注意,到目前为止尚未完成任何路由决策或选项处理;这就是 的工作ip_rcv_finish。正如我们在本章前面所预期的那样,该函数以对 Netfilter 子系统的调用结束,或多或少可以这样理解:

Finally we get to the end of the function. Note that no routing decision or option handling has been done so far; that's the job of ip_rcv_finish. As we anticipated earlier in the chapter, the function ends with a call to the Netfilter subsystem, which more or less can be read in this way:

"skb是从设备收到的数据包 dev;请检查数据包是否允许继续其传输,或者是否需要更改。请考虑到我们是从NF_IP_PRE_ROUTING网络堆栈内的点向您询问此问题(这意味着数据包已收到,但尚未做出路由决定)。如果您决定不丢弃数据包,请执行ip_rcv_finish。”

"skb is the packet that was received from device dev; please check whether the packet is allowed to proceed with its travel, or if it needs changes. Take into consideration that we are asking you this from the NF_IP_PRE_ROUTING point within the network stack (which means the packet was received but no routing decision was taken yet). If you decide not to drop the packet, execute ip_rcv_finish."

    返回 NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL,
                   ip_rcv_finish);
    return NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL,
                   ip_rcv_finish);

有关背景信息,请参阅前面的“与 Netfilter 的交互”部分。

See the earlier section "Interaction with Netfilter" for background information.

ip_rcv_finish 函数

The ip_rcv_finish Function

ip_rcv除了对数据包进行基本的健全性检查之外,没有做更多的事情。所以当ip_rcv_finish被调用时,它会处理主要的处理,例如:

ip_rcv did not do much more than a basic sanity check of the packet. So when ip_rcv_finish is called, it will take care of the main processing, such as:

  • 决定数据包是否必须在本地传递或转发。第二种情况,需要同时找到出口设备和下一跳。

  • Deciding whether the packet has to be locally delivered or forwarded. In the second case, it needs to find both the egress device and the next hop.

  • 解析和处理一些IP选项。然而,并非所有选项都在这里处理,正如我们在分析转发案例时将看到的那样。

  • Parsing and processing some of the IP options. Not all of the options are processed here, however, as we will see when analyzing the forwarding case.

这是该函数的原型ip_rcv_finish ,在与 相同的net/ipv4/ip_input.c 文件中定义ip_rcv

This is the prototype of the ip_rcv_finish function, defined in the same net/ipv4/ip_input.c file as ip_rcv.

静态内联 int ip_rcv_finish(struct sk_buff *skb)
static inline int ip_rcv_finish(struct sk_buff *skb)

skb->nh字段在 中初始化netif_receive_skb,它在接收路径中较早出现。当时 L3 协议还不为人所知,因此使用nh.raw. 现在该函数可以获得指向 IP 标头的指针。

The skb->nh field was initialized in netif_receive_skb, which came earlier in the receiving path. At that time, the L3 protocol was not yet known, so it was initialized using nh.raw. Now the function can get a pointer to the IP header.

struct net_device *dev = skb->dev;
    结构 iphdr *iph = skb->nh.iph;
struct net_device *dev = skb->dev;
    struct iphdr *iph = skb->nh.iph;

skb->dst可能包含有关数据包到达其目的地所采取的路线的信息。如果还不知道该信息,该功能会询问路由子系统将数据包发送到哪里,如果后者说目的地无法到达,则数据包将被丢弃。有关何时不为 NULL 的示例,请参阅第 20 章中的“本地传递”部分。skb->dst

skb->dst may contain information about the route to be taken by the packet to get to its destination. If that information is not known yet, the function asks the routing subsystem where to send the packet, and if the latter says the destination is unreachable, the packet is dropped. See the section "Local Delivery" in Chapter 20 for an example of when skb->dst is not NULL here.

    if (skb->dst == NULL) {
        if (ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev))
            转到下降;
    }
    if (skb->dst == NULL) {
        if (ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev))
            goto drop;
    }

然后该函数更新流量控制(服务质量或 QoS 层)使用的一些统计数据。

Then the function updates some statistics that are used by Traffic Control (the Quality of Service, or QoS, layer).

#ifdef CONFIG_NET_CLS_ROUTE
    if (skb->dst->tclassid) {
        结构 ip_rt_acct *st = ip_rt_acct + 256*smp_processor_id( );
        u32 idx = skb->dst->tclassid;
        st[idx&0xFF].o_packets++;
        st[idx&0xFF].o_bytes+=skb->len;
        st[(idx>>16)&0xFF].i_packets++;
        st[(idx>>16)&0xFF].i_bytes+=skb->len;
    }
#万一
#ifdef CONFIG_NET_CLS_ROUTE
    if (skb->dst->tclassid) {
        struct ip_rt_acct *st = ip_rt_acct + 256*smp_processor_id( );
        u32 idx = skb->dst->tclassid;
        st[idx&0xFF].o_packets++;
        st[idx&0xFF].o_bytes+=skb->len;
        st[(idx>>16)&0xFF].i_packets++;
        st[(idx>>16)&0xFF].i_bytes+=skb->len;
    }
#endif

当IP头的长度大于20字节[ * ](5×32位)时,意味着有选项需要处理。skb_cow,其名称来自众所周知的短语“写入时复制”,如果缓冲区与其他人共享,则在此处调用它来复制缓冲区。需要缓冲区的独占所有权,因为我们即将处理选项并且可能需要更改 IP 标头。

When the length of the IP header is bigger than 20 bytes[*] (5 × 32 bits) it means there are options to process. skb_cow, whose name comes from the well-known phrase "Copy on Write," is called here to make a copy of the buffer if the latter is shared with someone else. Exclusive ownership of the buffer is needed because we are about to process the options and will probably need to change the IP header.

    如果 (iph->ihl > 5) {
        结构 ip_options *opt;
        如果(skb_cow(skb,skb_headroom(skb))){
                   IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS);
            转到下降;
        }
        iph = skb->nh.iph;
    if (iph->ihl > 5) {
        struct ip_options *opt;
        if (skb_cow(skb, skb_headroom(skb))) {
                   IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS);
            goto drop;
        }
        iph = skb->nh.iph;

ip_option_compile用于解释标头中携带的IP选项。下一节将详细描述其实现。现在我们对该函数的输出感兴趣。我们在第 2 章中看到,其中skb包含一个名为 的字段cb,任何管理缓冲区的人都可以用它来存储私有数据sk_buff。在这种情况下,IP 层使用它来存储 IP 标头选项解析的结果以及一些其他内容,例如与分段相关的信息。结果存储在类型为 的数据结构中,在include/net/ip.hstruct inet_skb_parm中定义并使用宏访问(请参阅第 23 章中的“ ipq 结构”部分)IPCB)。

ip_option_compile is used to interpret the IP options carried in the header. The next section describes its implementation in detail. Right now we are interested in the output of that function. We saw in Chapter 2 that skb contains a field called cb that can be used to store private data by whomever manages an sk_buff buffer. In this case, the IP layer uses it to store the result of the IP header option parsing plus some other stuff such as fragmentation-related information. The result is stored in a data structure of type struct inet_skb_parm, defined in include/net/ip.h and accessed with the macro IPCB (see the section "ipq Structure" in Chapter 23).

如果有任何错误的选项,数据包将被丢弃,并向发送方发送一条特殊的互联网控制消息协议 (ICMP) 消息,以通知后者有关问题的信息。正如我们将在第 25 章中看到的,ICMP 消息包含有关在标头中发现错误的位置的信息,这可以帮助发送者了解发生的情况。

If there are any wrong options, the packet is discarded and a special Internet Control Message Protocol (ICMP) message is sent back to the sender to notify the latter about the problem. As we will see in Chapter 25, ICMP messages contain information about where the error was found in the header, something that could help the sender to understand what happened.

您将在下一节中看到,当第一个输入参数为ip_options_compileNULL 时,解析过程的输出存储在IPCB(skb)->opt;中。这解释了为什么使用 检索已解析的选项IPCB

You will see in the next section that when the first input parameter to ip_options_compile is NULL, the output of the parsing process is stored in IPCB(skb)->opt; this explains why the parsed options are retrieved with IPCB.

            if (ip_options_compile(NULL, skb))
                    转到inhdr_error;
            if (ip_options_compile(NULL, skb))
                    goto inhdr_error;

请注意,它ip_options_compile只是检查选项是否正确并将它们存储在ip_option指向的私有数据字段内的结构中skb->cb。该函数不处理其中任何一个。相反,接下来的代码部分会解决这个问题。

Note that ip_options_compile simply checks whether the options are correct and stores them in an ip_option structure inside the private data field pointed to by skb->cb. The function does not handle any of them. Instead, the upcoming piece of code partially takes care of that.

如果数据包是源路由的,内核需要检查设备的配置是否允许使用该选项。(如果您不熟悉 IP 源路由,请查看“选项:严格和松散源路由”部分。)

In case the packet was source routed, the kernel needs to check whether the configuration of the device allows that option to be used. (If you are not familiar with IP source routing, check the section "Option: Strict and Loose Source Routing.")

我在第 23 章的“ in_device 结构in_device部分简要描述了该结构和相关的 API 。如果没有显式配置 IP 源路由,则默认情况下允许该选项。另一方面,如果禁用该选项,则数据包将被丢弃(但不会生成 ICMP 消息)。是一个定义在include/linux/kernel.h中的简单宏,它将 32 位变量拆分为四个 8 位组件。NIPQUAD

I briefly describe the in_device structure and the associated APIs in the section "in_device Structure" in Chapter 23. If there was no explicit configuration for IP source routing, the option would be allowed by default. If, on the other hand, that option was disabled, the packet is dropped (but no ICMP message is generated). NIPQUAD is a simple macro defined in include/linux/kernel.h that splits a 32-bit variable into four 8-bit components.

            如果(选择-> srr){
                结构 in_device *in_dev = in_dev_get(dev);
                如果(in_dev){
                    如果(!IN_DEV_SOURCE_ROUTE(in_dev)){
                        if (IN_DEV_LOG_MARTIANS(in_dev) && net_ratelimit( ))
                            printk(KERN_INFO "源路由选项 %u.%u.%u.%u -> %u.
  %u.%u.%u\n",
                                   NIPQUAD(iph->saddr), NIPQUAD(iph->daddr));
                            in_dev_put(in_dev);
                            转到下降;
                        }
                        in_dev_put(in_dev);
            }
            如果(ip_options_rcv_srr(skb))
                转到下降;
        }
    }
            if (opt->srr) {
                struct in_device *in_dev = in_dev_get(dev);
                if (in_dev) {
                    if (!IN_DEV_SOURCE_ROUTE(in_dev)) {
                        if (IN_DEV_LOG_MARTIANS(in_dev) && net_ratelimit( ))
                            printk(KERN_INFO "source route option %u.%u.%u.%u -> %u.
  %u.%u.%u\n",
                                   NIPQUAD(iph->saddr), NIPQUAD(iph->daddr));
                            in_dev_put(in_dev);
                            goto drop;
                        }
                        in_dev_put(in_dev);
            }
            if (ip_options_rcv_srr(skb))
                goto drop;
        }
    }

当设备上允许 IP 源路由时,代码会调用ip_options_rcv_srr设置skb->dst并决定如何处理数据包,这意味着决定使用哪个设备将数据包转发到源路由列表中的下一跳。通常,所请求的下一跳指向另一台主机,并且该函数简单地设置 opt->srr_is_hit以指示已经找到该地址。然而,该ip_options_rcv_srr功能必须考虑“下一跳”可能是本地主机上的接口的可能性。如果发生这种情况,该函数会将 IP 地址写入 IP 标头的目标 IP 地址,然后继续检查源路由列表中的下一个地址(如果有)(在代码中,这称为超快环回转发)。ip_options_rcv_srr继续浏览 IP 标头源路由选项块中的下一跳列表,直到找到非主机本地的 IP 地址。通常,该列表中的本地 IP 地址不会超过一个。然而,拥有超过一个是合法的。在后一种情况下,从下一跳到下一个跳是无操作,即,内部又多了一个循环ip_options_rcv_srrsrr_is_hit当发现的最后一个下一跳 ip_options_rcv_srr不是本地IP地址该

When IP source routing is allowed on the device, the code calls ip_options_rcv_srr to set skb->dst and decide how to handle the packet, which means deciding which device to use to forward the packet toward the next hop in the source route list. Normally, the requested next hop refers to another host, and the function simply sets opt->srr_is_hit to indicate the address has been found. The ip_options_rcv_srr function has to take into account, however, the possibility that the "next hop" may be an interface on the local host. If that happens, the function writes the IP address into the destination IP address of the IP header and goes on to check the next address in the source routing list, if there is one (in the code, this is called a superfast loopback forward). ip_options_rcv_srr keeps browsing the list of next hops in the IP header source routing option block until it finds an IP address that is not local to the host. Normally, there will be no more than one local IP address in that list. However, it is legal to have more than one. In the latter case, going from one next hop to the following one is a no-op—i.e., one more loop inside ip_options_rcv_srr. The srr_is_hit flag is set when the last next-hop found by ip_options_rcv_srr is not a local IP address, which means the packet has not reached its final destination and needs to be forwarded.

如果要转发数据包,正如我们将在第 20 章的“ ip_forward_finish 函数”部分中看到的那样,初始化告诉我们通过将必要的数据添加到 IP 标头来处理源路由选项。如果数据包正在传输(即,如果它源自该主机),则将使用该标志,并且不会使用该标志。srr_is_hitip_forward_optionsopt->faddropt->srr_is_hit

If the packet is to be forwarded, as we will see in the section "ip_forward_finish Function" in Chapter 20, the initialization of srr_is_hit tells ip_forward_options to take care of the source routing option by adding the necessary data to the IP header. If the packet is being transmitted (that is, if it originated on this host), opt->faddr will be used instead and the opt->srr_is_hit flag will not be used.

MARTIANS前面的代码中使用该术语来判断参数值是否错误。该术语并非 Linux 开发人员的异想天开,而是来自 RFC 本身。

The term MARTIANS is used in the previous code to decide whether a parameter value is wrong. The term is not a fanciful choice by the Linux developers but comes from the RFCs themselves.

ip_rcv_finish以调用结束,它实际上调用存储在缓冲区字段dst_input中的函数 。在 的开头附近或结尾附近进行初始化(如果标头中存在 IP 源路由选项,则调用此函数)。设置为或,具体取决于数据包的目标地址。因此,对 的调用完成了数据包的处理(参见第 18 章中的图 18-1以及前面的“与路由子系统的交互”部分)。dstskbskb->dstip_rcv_finiship_options_rcv_srrskb->dst->inputip_local_deliverip_forwarddst_input

ip_rcv_finish ends with a call to dst_input, which actually invokes the function stored in the dst field of the skb buffer. skb->dst was initialized either near the beginning of ip_rcv_finish, or near the end within ip_options_rcv_srr (which is called if the IP source routing option is present in the header). skb->dst->input is set to ip_local_deliver or ip_forward, depending on the destination address of the packet. The call to dst_input therefore completes the processing of the packet (see Figure 18-1 in Chapter 18 and the earlier section "Interaction with the Routing Subsystem").

另请参阅第 35 章中的“源路由”一节,了解对in 的 调用和 in之间的关系。ip_route_inputip_rcv_finiship_options_rcv_srr

See also the section "Source Routing" in Chapter 35 for the relationship between the call to ip_route_input in ip_rcv_finish and the one in ip_options_rcv_srr.

IP选项

IP Options

由于与处理 IP 选项所需的时间相关的开销,它们从未被大量使用过。在接下来的部分中,我们将一一了解 Linux 内核处理的 IP 选项以及它们是如何处理的。

Because of the overhead associated with the time needed to process IP options , they have never been used much. In the next sections, we will see one by one the IP options handled by the Linux kernel and how they are processed.

以下是与 IP 选项管理相关的主要 API,它们全部定义在 net/ipv4/ip_options.c中。要了解其中一些选项,请记住,并非数据包的所有 IP 选项都需要在其所有片段中复制。

Here are the main APIs involved with IP option management, all of them defined in net/ipv4/ip_options.c. To understand some of them, remember that not all of the IP options of a packet need to be replicated in all of its fragments.

ip_options_compile
ip_options_compile

从 IP 标头中解析选项块并 ip_options相应地初始化结构的实例。稍后将使用该结构来处理选项;它包括标志和指针,告诉路由子系统中处理转发的部分,哪些内容必须写入 IP 标头选项空间以及写入位置。在“选项解析ip_options_compile”一节中有详细描述。

Parses a block of options from an IP header and initializes an instance of an ip_options structure accordingly. This structure will be used later to process the options; it includes flags and pointers that tell the part of the routing subsystem that handles forwarding what has to be written into the IP header options space, and where. ip_options_compile is described in detail in the section "Option Parsing."

ip_options_build
ip_options_build

根据输入结构初始化专用于选项的 IP 标头部分ip_options。当传输本地生成的数据包时使用此功能。由于输入参数,它可以区分片段并相应地处理它们:它从每个片段的标头中省略那些不必复制到该片段中的选项(请参阅第 18 章中的“IP 选项”部分并覆盖他们用空选项代替。它还清除结构的标志(例如),这些标志用于表示需要向选项添加时间戳或地址。ip_optionsopt->rr_needaddr

Initializes the portion of an IP header dedicated to the options, based on an input ip_options structure. This function is used when transmitting locally generated packets. Thanks to an input parameter, it can distinguish fragments and treat them accordingly: it omits from the header of each fragment those options that do not have to be copied into that fragment (see the section "IP options" in Chapter 18), and overwrites them with null options instead. It also clears the flags of the ip_options structure (such as opt->rr_needaddr) that are used to signal the need to add a timestamp or an address to the options.

ip_options_fragment
ip_options_fragment

由于第一个片段是唯一继承原始数据包所有选项的片段,因此其标头的大小应大于或等于后续标头的大小。然而,Linux 简化了这个规则。通过为所有碎片保持相同的标头大小,Linux 使碎片过程更简单、更高效。这是通过复制原始标头及其所有选项并使用空选项 ( ) 覆盖不需要复制的选项(未设置的选项 IPOPT_COPY)并清除与它们关联的结构IPOPT_NOOP的所有标志 (例如,ip_optionsts_needaddr),在除第一个片段之外的所有片段上。空选项将在后面的“选项解析”部分中描述。

这最后一个操作正是 的目的ip_options_fragmentip_fragment当我们在第 22 章讨论时,我们会看到,在发送第一个 IP 片段之后,内核调用 ip_options_fragment更改 IP 标头,并为后续所有片段回收新的适配标头。

Because the first fragment is the only one that inherits all the options of the original packet, the size of its header is supposed to be greater than or equal to the size of the following ones. Linux simplified this rule, however. By keeping the same header size for all fragments, Linux makes the fragmentation process simpler and more efficient. This is achieved by copying the original header with all its options and overwriting the options that do not need to be replicated (those where IPOPT_COPY is not set) with null options (IPOPT_NOOP) and clearing all the flags of the ip_options structure associated with them (e.g., ts_needaddr), on all fragments but the first one. Null options are described later in the section "Option Parsing."

This last operation is exactly the purpose of ip_options_fragment. When we talk about ip_fragment in Chapter 22, we will see that after the first IP fragment has been sent, the kernel calls ip_options_fragment to change the IP header, and recycles the new adapted header thereafter for all of the following fragments.

ip_forward_options
ip_forward_options

转发数据包时,可能需要处理一些选项。ip_options_compile解析选项并在ip_options用于存储解析结果的结构中初始化一组标志。稍后ip_forward会处理它们。

When forwarding a packet, some options may need to be processed. ip_options_compile parses the options and initializes a set of flags in the ip_options structure used to store the result of the parsing. Later, ip_forward will handle them.

ip_options_get
ip_options_get

该函数接收一个选项块,用 解析它们ip_options_compile,并将结果存储在ip_options它分配的结构中。它可以从内核空间或用户空间接收输入选项;有一个输入参数来指定源。使用示例是通过ip_setsockoptL4 协议(例如 TCP 和 UDP)使用的函数来设置给定套接字上的 IP 选项(请参阅系统调用setsockopt)。ip_options_get负责第 18 章“ ‘选项列表结束’和‘无操作’选项”部分中描述的填充。

This function receives a block of options, parses them with ip_options_compile, and stores the result in an ip_options structure it allocates. It can receive the input options from either kernel space or user space; there is an input parameter to specify the source. An example of usage is via the ip_setsockopt function that is used by L4 protocols such as TCP and UDP to set the IP options on a given socket (see the system call setsockopt). ip_options_get takes care of the padding described in the section "'End of option list' and 'No operation' options" in Chapter 18.

ip_options_echo
ip_options_echo

给定入口 IP 数据包及其 IP 选项,此函数构建 IP 选项以用于回复发送者。例如,回复数据包上的源路由选项必须相反。请参阅 RFC 1122(Internet 主机的要求)第 3.2.1.8、4.1.3.2 和 4.2.3.8 节以及 RFC 1812(IP 版本 4 路由器的要求)。

调用此例程的一些地方包括:

  • icmp_reply回复入口 ICMP 请求

  • icmp_send当入口 IP 数据包满足需要生成 ICMP 消息的条件时

  • ip_send_reply,这是 IP 提供的用于回复入口 IP 数据包的通用例程

  • TCP保存入口SYN段的选项

Given an ingress IP packet and its IP options, this function builds the IP options to use to reply back to the sender. For example, the source route options must be reversed on the reply packet. Refer to RFC 1122 (Requirements for Internet Hosts), sections 3.2.1.8, 4.1.3.2, and 4.2.3.8, and to RFC 1812 (Requirements for IP Version 4 Routers).

Some of the places where this routine is invoked include:

  • icmp_reply to reply to an ingress ICMP request

  • icmp_send when an ingress IP packet meets conditions that require the generation of an ICMP message

  • ip_send_reply, which is the generic routine provided by IP to reply to an ingress IP packet

  • TCP to save the options of an ingress SYN segment

现在让我们看看这些函数在实际中是如何使用的。因为你还没有看到第18章图18-1中所有函数的内部原理,所以现阶段你可能还没有理解一切。熟悉其他功能后,您可以返回本节的第二部分。

Now let's see how the functions are used in practice. Because you have not yet seen the internals of all the functions in Figure 18-1 in Chapter 18, you may not understand everything at this stage. You can come back to this second part of the section once you are familiar with the other functions.

正如您在第 18 章的图 18-1中看到的那样,不同的路径可以导致数据包的传输,并且它们处理 IP 选项的方式也略有不同。我将介绍两个案例,并将其他案例留给您作为练习。

As you saw in Figure 18-1 in Chapter 18, different paths can lead to the transmission of a packet, and they handle the IP options in slightly different ways. I will cover two cases and leave you the others as an exercise.

选项处理

Option Processing

首先使用该函数解析入口 IP 数据包的选项ip_options_compile ,如下一节所述。正如上一节中提到的,选项在不同的时间由不同的例程处理,具体取决于数据包是否要转发、分段等。图 19-3 总结了上一节中介绍的关键例程带有较浅的颜色)用于入口数据包和本地生成的数据包。

The options of an ingress IP packet are first parsed with the ip_options_compile function, described in the next section. As mentioned in the previous section, the options are then processed by different routines at different times, depending on whether a packet is to be forwarded, fragmented, etc. Figure 19-3 summarizes where the key routines introduced in the previous section (with a lighter color) are called for ingress packets and for locally generated packets.

当要转发入口数据包时,ip_rcv_finish调用ip_forward(via dst_input)来处理转发过程。 ip_forward处理路由器警报选项(如果存在),并确保严格源路由选项没有问题。然后要求ip_forward_finish完成转发工作。后者的行为可能有所不同,具体取决于标头是否包含选项。

When an ingress packet is to be forwarded, ip_rcv_finish calls ip_forward (via dst_input) to take care of the forwarding process. ip_forward handles the Router Alert option, if present, and makes sure that there are no problems with the strict source route option. Then it asks ip_forward_finish to complete the job of forwarding. The latter can behave differently depending on whether the header contains options.

假设数据包有选项。在这种情况下,ip_forward_finish调用ip_forward_options来处理转发数据包时应该处理的那些选项,然后调用dst_output来进行实际的传输。如第18章18-1所示,当入口IP报文需要转发时结束呼叫。dst_outputip_output

Let's suppose the packet had options. In this case, ip_forward_finish calls ip_forward_options to handle those options that should be processed when forwarding a packet, and then calls dst_output to carry out the actual transmission. As shown in Figure 18-1 in Chapter 18, dst_output ends up calling ip_output when the ingress IP packet needs to be forwarded.

(a) 入口数据包; (b) 本地生成的数据包

图 19-3。(a) 入口数据包;(b) 本地生成的数据包

Figure 19-3. (a) Ingress packets; (b) locally generated packets

在此阶段,IP 标头已可供使用,因为所有选项均已处理完毕。如果没有碎片,则选项处理完成。但是,如果需要对数据包进行分片,ip_output则需要确保只有第一个分片包含所有选项;根据第 18 章表 18-1,其他的应该只有一个子集。在这种情况下, 调用. 第一个片段完成后,用于清除后续片段不需要的选项。这样,可以继续从原始数据包中复制 IP 标头,并使所有选项都正确。ip_outputip_fragmentip_fragmentip_options_fragmentip_fragment

At this stage, the IP header is ready to be used, because all of the options have been processed. If there was no fragmentation, options processing is finished. However, if the packet needs to be fragmented, ip_output needs to make sure that only the first fragment includes all of the options; the others should have only a subset, according to Table 18-1 in Chapter 18. In this case, ip_output calls ip_fragment. Once the first fragment is done, ip_fragment uses ip_options_fragment to clear the options that are not needed for the subsequent fragments. This way, ip_fragment can keep copying the IP header from the original packet and have all the options correct.

在本地生成的数据包中,选项由 处理ip_options_build我们将在第 21 章中看到ip_queue_xmit和是如何使用该函数的ip_push_pending_frames

In a locally generated packet, options are handled with ip_options_build. We will see in Chapter 21 how that function is used by ip_queue_xmit and ip_push_pending_frames.

选项解析

Option Parsing

ip_options这里的解析意味着从 IP 数据包标头中存储的格式中提取 IP 选项,并将它们存储在更便于程序代码处理的结构中。将它们存储在专用数据结构中非常有用,因为不同的选项在 IP 代码的不同部分进行处理。 ip_options_compile仅解析选项,不处理它们。我们在上一节中看到了选项的处理过程。

Parsing, here, means extracting the IP options from the format in which they are stored in an IP packet's header and storing them in a structure called ip_options that is more convenient for program code to handle. Storing them in a dedicated data structure is useful because different options are handled in different parts of the IP code. ip_options_compile only parses the options, it does not process them. We saw in the previous section where options are processed.

该函数ip_options_compile在两种不同的情况下被调用:

The function ip_options_compile is called in two different cases:

  • 通过ip_rcv_finish解析和验证输入数据包的 IP 选项。如第18章18-1所示, 所有入口数据包都会被调用,无论它们是本地传送还是转发。当我在本节中提到入口数据包时,我包括需要转发的入口数据包的情况,因为它们不是寻址到本地系统的。ip_rcv_finish

  • By ip_rcv_finish to parse and validate the IP options of the input packets. As shown in Figure 18-1 in Chapter 18, ip_rcv_finish is called for all ingress packets, regardless of whether they will be delivered locally or forwarded. When I refer to ingress packets in this section, I am including the case of ingress packets that need to be forwarded because they are not addressed to the local system.

  • 例如,通过解析套接字系统调用的ip_options_get输入。setsockoptAF_INET

  • By ip_options_get, for example, to parse the input to the setsockopt system call for AF_INET sockets.

现在我们来分析一下如何ip_options_compile解析IP数据包头的选项。这是该函数的原型:

Let's now analyze how ip_options_compile parses the options of an IP packet's header. This is the function's prototype:

int ip_options_compile(结构 ip_options * opt, 结构 sk_buff * skb)
int ip_options_compile(struct ip_options * opt, struct sk_buff * skb)

两个输入参数的值让函数知道调用它的上下文:

The values of the two input parameters let the function know the context in which it is being called:

  • 入口数据包:skbnot NULL(在本例中 opt为 NULL)

  • Ingress packet: skb not NULL (in this case, opt is NULL)

  • 正在传输的数据包:skb等于 NULL(在本例中opt为非 NULL)

  • Packet being transmitted: skb equal to NULL (in this case, opt is non-NULL)

这意味着根据函数的上下文,IP 标头存储在不同的位置。当传输本地生成的数据包时,opt不为 NULL,并且opt->data 包含指向先前部分由调用者生成的 IP 标头的指针。相反,如果该函数正在处理入口数据包,则标头包含在skb输入缓冲区中并且opt为 NULL。在第二种情况下,ip_options结构存储在skb->cb. ip_options_compile初始化局部变量,例如optptr根据 IP 标头所在的位置(即skb->nhopt->_ _data)。的值skb也经常被用来ip_options_compile区分前面的两种情况。

This means that depending on the function's context, the IP header is stored in different places. When transmitting a locally generated packet, opt is not NULL and opt->data contains a pointer to an IP header that was previously partially generated by the caller. If instead the function is processing an ingress packet, the header is contained in the skb input buffer and opt is NULL. In this second case, the ip_options structure is stored in skb->cb. ip_options_compile initializes local variables such as optptr according to where the IP header is located (i.e., skb->nh or opt->_ _data). The value of skb is also often used by ip_options_compile to distinguish between the two previous cases.

在这两种情况下(传输和转发),您都需要填写opt。唯一要做的选择是从哪里获取要解析的输入 IP 标头以及在哪里存储结果。

In both cases (transmit and forward), you need to fill in opt. The only choices to make are where to get the input IP header to parse and where to store the result.

    如果(!选择){
        opt = &(IPCB(skb)->opt);
        memset(opt, 0, sizeof(struct ip_options));
        iph = skb->nh.raw;
        opt->optlen = ((struct iphdr *)iph)->ihl*4 - sizeof(struct iphdr);
        optptr = iph + sizeof(struct iphdr);
        选择-> is_data = 0;
    } 别的 {
        optptr = opt->is_data ?opt->_ _data : (无符号字符*)&(skb->nh.iph[1]);
        iph = optptr - sizeof(struct iphdr);
    }
    if (!opt) {
        opt = &(IPCB(skb)->opt);
        memset(opt, 0, sizeof(struct ip_options));
        iph = skb->nh.raw;
        opt->optlen = ((struct iphdr *)iph)->ihl*4 - sizeof(struct iphdr);
        optptr = iph + sizeof(struct iphdr);
        opt->is_data = 0;
    } else {
        optptr = opt->is_data ? opt->_ _data : (unsigned char*)&(skb->nh.iph[1]);
        iph = optptr - sizeof(struct iphdr);
    }

如果解析失败,ip_options_compile则立即返回。调用者将通过以下方式之一处理事件,具体取决于接收或传输的数据包是否使用了选项:

If parsing fails, ip_options_compile returns immediately. The caller will handle the event in one of the following ways, depending on whether the options were used by a received or transmitted packet:

收到的数据包中有错误的选项
Bad option in a received packet

ICMP 消息被发送回源。

An ICMP message is sent back to the source.

传输数据包中的错误选项
Bad option in a transmitted packet

通过用于传输数据包的函数返回的错误值来通知应用程序。

The application is notified through an error value returned by the function used to transmit the packet.

解析失败的可能原因包括:

Among the possible reasons for a parsing failure are:

  • 单个选项不能在标头中出现多次。唯一的例外是虚拟或空选项IPOPT_NOOP。后者可以出现任意多次,并且通常用于强制某种对齐,无论是在单个选项上还是在选项后面的有效负载上(空选项不需要处理)。

  • A single option cannot be present more than once in the header. The only exception is the dummy or null option IPOPT_NOOP. The latter can be present any number of times and is usually used to enforce some kind of alignment, either on an individual option or on the payload that follows the options (the null option needs no handling).

  • 标头字段的值被分配了无效值,或者当前用户不允许使用的值。这种情况适用于本地生成的流量。仅允许超级用户生成带有内核无法理解的选项或子选项代码的 IP 数据包。超级用户权限的检查是由该capable函数完成的。

    最初的 IP RFC 说,当收到不理解的选项时,路由器应该忽略它。Linux 仅对本地生成的数据包有不同的行为(请参阅前面的参考资料capable)。

  • The value of a header field has been assigned an invalid value, or a value that the current user is not allowed to use. This case applies to locally generated traffic. Only the superuser is allowed to generate IP packets with option or suboption codes not understood by the kernel. The check for the superuser privilege is done by the capable function.

    The original IP RFC says that when receiving an option that is not understood, a router should just ignore it. Linux behaves differently only with locally generated packets (see the earlier reference to capable).

目前,只有两个单字节选项:

Currently, there are only two single-byte options:

  • 选项结束 ( IPOPT_END)

  • End of options (IPOPT_END)

  • 空选项 ( IPOPT_NOOP)

  • Null option (IPOPT_NOOP)

for循环只是逐个选择并将解析结果存储在输出ip_options结构中opt。循环内的代码可能看起来很复杂,但实际上如果考虑以下几点,就很容易阅读:

The main for loop simply goes option by option and stores the result of parsing in the output ip_options structure opt. The code inside the loop may look complex, but actually it is very easy to read if you take into consideration the following points:

  • l表示尚未解析的选项块的大小。[ * ]

  • l represents the size of the block of options that has not been parsed yet.[*]

  • optptr指向正在分析的选项块的当前位置。optptr[1]是选项的长度,optptr[2]是选项指针(选项开始的位置)。图 19-4显示了数组元素指向的位置。处理每个选项的代码始终以基于这些参数的两次健全性检查开始。

  • optptr points to the current position on the block of options being analyzed. optptr[1] is the option's length, and optptr[2] is the option pointer (where the option starts). Figure 19-4 shows where the array's elements point. The code that handles each option always starts with two sanity checks based on these parameters.

  • optlen被初始化为当前选项的长度。不要optlen与 混淆opt->optlen。请注意,当opt不为 NULL 时,optlen不会初始化,因为这已经在 中完成了ip_options_get

  • optlen gets initialized to the length of the current option. Do not confuse optlen with opt->optlen. Note that when opt is not NULL, optlen is not initialized because that has already been done in ip_options_get.

  • 该标志is_changed用于跟踪标头何时更改(这需要重新计算校验和)。

  • The flag is_changed is used to keep track of when the header has been changed (which requires the checksum to be recomputed).

执行过程中 ip_options_compile 的局部变量值

图 19-4。执行过程中 ip_options_compile 的局部变量值

Figure 19-4. ip_options_compile's local variables' values in the middle of an execution

该选项后不能有其他选项IPOPT_END 。因此,一旦找到一个,后面的任何内容都会被更多 IPOPT_END选项覆盖。

There cannot be other options after the IPOPT_END option. Therefore, as soon as one is found, whatever follows it is overwritten with more IPOPT_END options.

多字节选项的基本健全性检查包括:

The basic sanity checks for multibyte options include:

  • 该选项的长度必须至少为四个字节。由于该选项的标头长度为 3 个字节,因此该字段pointer不能小于 4。例如,时间戳选项需要至少 5 个八位字节的长度,其中 4 个字节仅由标头使用(参见中的图 18-8 )第 18 章)。

  • The option must be at least four bytes long. Since the header of the option is three bytes long, the field pointer cannot be smaller than 4. The timestamp option, for instance, requires at least a length of five octets, where four are used just by the header (See Figure 18-8 in Chapter 18).

  • 在标头中保留空间的选项,因为它们应该由下一跳或目标主机填充,所以必须遵守选项所需的大小。例如,时间戳选项应该保留四个字节(IPv4 地址的大小)倍数的空间。

  • Options that reserve space in the header, because they are supposed to be filled in by the next hops or by the destination host, must respect the size required by the option. For instance, the timestamp option is supposed to reserve a space that is a multiple of four bytes (the size of an IPv4 address).

由于每个选项的长度包括前两个字节 (typelength),并且它从 1(而不是 0)开始计数,因此如果length小于 2 或大于要分析的选项块,则会出现错误:

Since the length of each option includes the first two bytes (type and length) and since it starts counting from 1 (not 0), if length is less than 2 or bigger than the block of options left to analyze, there is an error:

        if (optlen<2 || optlen>l) {
            pp_ptr = optptr;
            转到错误;
        }
        if (optlen<2 || optlen>l) {
            pp_ptr = optptr;
            goto error;
        }

请注意,某些选项(例如 TIMESTAMP)的最小长度大于 2,因此刚才显示的一般检查是必要的,但并不总是足够的。更具体的检查位于每个选项处理程序内。当选项中发现错误时,必须将特殊的 ICMP 消息发送回发送者。此 ICMP 数据包包括原始 IP 标头、八个字节的 IP 负载以及指向发现错误位置的偏移量。IP 有效负载的 8 个字节由 L4 标头的开头组成,通常包括 L4 端口号;这允许 ICMP 错误消息的接收者找到与错误 IP 数据包关联的套接字(更多详细信息请参见第 25 章))。在返回错误消息之前,代码会初始化pp_ptr以指向发现问题的位置。

Note that some options (such as TIMESTAMP) have a minimum length bigger than 2, and thus the general check just shown is necessary but not always sufficient. The more specific checks are inside the per-option handlers. When an error is found in the options, a special ICMP message has to be sent back to the sender. This ICMP packet includes the original IP header, eight bytes of the IP payload, and an offset that points to where the error was found. The eight bytes of the IP payload consist of the start of the L4 header and usually include the L4 port numbers; this allows the receiver of the ICMP error message to find the socket associated with the faulty IP packet (more details in Chapter 25). Before returning the error message, the code initializes pp_ptr to point to the place where the problem was found.

switch语句使用选项type字段作为其鉴别符。因此,每个选项都由不同的 处理statement,就像之前对单字节选项所做的那样:

The switch statement uses, as its discriminator, the option type field. Therefore, each option in handled by a different statement, exactly as was done before for the single-byte options:

        开关(*optptr)
        switch (*optptr)

接下来的部分将一一分析多字节选项,图 19-5(a)19-5(b)显示了总体情况。两个过时的选项 SEC 和 SIC 已识别但未处理[ * ](请参阅 RFC 1812)。

The next sections analyze the multibyte options one by one, and Figures 19-5(a) and 19-5(b) show the big picture. The two obsolete options SEC and SIC are recognized but not processed[*] (see RFC 1812).

选项:严格和宽松源路由

Option: strict and loose Source Routing

标题中只能出现一个源路由选项。该标志opt->srr用于检测该条件:如果以下代码在选项中没有发现任何错误,则会设置该标志。如果标题中稍后出现相同类型的另一个选项,则会检测到错误。

Only one Source Routing option can appear in a header. The flag opt->srr is used to detect that condition: if the following code does not find any error in the option, it sets that flag. If another option of the same type appears later in the header, the error will be detected.

opt->is_strictroute用于告诉调用者源路由选项是宽松还是严格。

opt->is_strictroute is used to tell the caller whether the Source Routing option was loose or strict.

第 20 章中的“ ip_forward 函数”部分展示了如果数据包在遵守源路由规则的情况下无法到达目的地,则如何丢弃数据包。

The section "ip_forward Function" in Chapter 20 shows how packets are dropped if they cannot reach their destinations while respecting the Source Routing rules.

如果选项的长度(包括typelength)小于 3,则该选项被视为错误。这是因为该值必须包含 typelengthpointer字段。同时,pointer值不能小于 4,因为选项的前三个字节已被typelengthpointer字段使用。

The option is considered faulty if the length of the option (including type and length) is less than 3. This is because the value has to contain the type, length, and pointer fields. At the same time, pointer cannot have a value smaller than 4 because the first three bytes of the option are already used by the type, length, and pointer fields.

当输入skb参数为NULL时,表示ip_options_compile已调用解析一个传出数据包的选项(本地生成,未转发)。在这种情况下,用户空间提供的地址数组中的第一个 IP 地址将被保存在opt->faddr数组中,然后通过操作将数组的其他元素移回一个位置来从数组中删除memmove。该地址稍后将由第 21 章中描述的功能ip_queue_xmit以及ip_append_data的用户检索,因此他们知道目标 IP 地址。opt->faddr 可以在函数 中找到一个易于理解的使用示例udp_sendmsg

When the input skb parameter is NULL, it means that ip_options_compile has been called to parse the options of an outgoing packet (generated locally, not forwarded). In that case, the first IP address in the array of addresses provided by user space is saved in opt->faddr and then removed from the array by shifting the other elements of the array back one position with a memmove operation. This address will be retrieved later by the functions described in Chapter 21, ip_queue_xmit, and the ip_append_data's users, so they know the destination IP address. An easy-to-follow example of the use of opt->faddr can be found in the function udp_sendmsg.

            如果(!skb){
                if (optptr[2] != 4 || optlen < 7 || ((optlen-3) & 3)) {
                    pp_ptr = optptr + 1;
                    转到错误;
                }
                memcpy(&opt->faddr, &optptr[3], 4);
                如果(选择长度 > 7)
                    memmove(&optptr[3], &optptr[7], optlen-7);
            }
            opt->is_strictroute = (optptr[0] == IPOPT_SSRR);
            opt->srr = optptr - iph;
            休息;
            if (!skb) {
                if (optptr[2] != 4 || optlen < 7 || ((optlen-3) & 3)) {
                    pp_ptr = optptr + 1;
                    goto error;
                }
                memcpy(&opt->faddr, &optptr[3], 4);
                if (optlen > 7)
                    memmove(&optptr[3], &optptr[7], optlen-7);
            }
            opt->is_strictroute = (optptr[0] == IPOPT_SSRR);
            opt->srr = optptr - iph;
            break;
ip_options_compile 概述

图 19-5a。ip_options_compile 概述

Figure 19-5a. ip_options_compile overview

ip_options_compile 概述

图 19-5b。ip_options_compile 概述

Figure 19-5b. ip_options_compile overview

选项:记录路线

Option: Record Route

对于“记录路由”选项,与“时间戳”一样,发送方会提前保留将要使用的标头部分。因此,在处理选项时,仅当还有剩余空间时,才会将新元素添加到标头中。如果有空间,该ip_options_compile函数会设置标志rr_needaddr,告诉路由子系统在做出路由决定后将传出接口的 IP 地址写入 IP 标头。[ * ]请注意,如果选项属于本地生成的数据包,则 IP 地址列表包括传输接口的地址。

For the Record Route option, as for Timestamp, the sender reserves the part of the header it will use in advance. Because of this, when processing the option, new elements are added to the header only if there is some room left. If there is space, the ip_options_compile function sets the flag rr_needaddr to tell the routing subsystem to write the IP address of the outgoing interface into the IP header once the routing decision is taken.[*] Note that the list of IP addresses includes the transmitting interface's address if the options belong to a locally generated packet.

            if (optptr[2] <= optlen) {
                if (optptr[2]+3 > optlen) {
                    pp_ptr = optptr + 2;
                    转到错误;
                }

                如果(skb){
                    memcpy(&optptr[optptr[2]-1], &rt->rt_spec_dst, 4);
                    选择-> is_changed = 1;
                }
                optptr[2] += 4;
                opt->rr_needaddr = 1;
            }
            opt->rr = optptr - iph;
            休息;
            if (optptr[2] <= optlen) {
                if (optptr[2]+3 > optlen) {
                    pp_ptr = optptr + 2;
                    goto error;
                }

                if (skb) {
                    memcpy(&optptr[optptr[2]-1], &rt->rt_spec_dst, 4);
                    opt->is_changed = 1;
                }
                optptr[2] += 4;
                opt->rr_needaddr = 1;
            }
            opt->rr = optptr - iph;
            break;

由于skb仅当您处理入口数据包的选项时 才为非空,因此这段代码只需将首选源 IP 地址复制到标头中记录的地址列表中,并更新标志 ,这将强制 IP 校验is_changed和要被更新。使用IP地址的原因 请参见第35章“首选源地址选择”部分。rt_spec_dst

Since skb is non-null only when you are processing the options of an ingress packet, this piece of code simply copies the preferred source IP address into the list of addresses being recorded in the header, and updates the flag is_changed, which will force the IP checksum to be updated. See the section "Preferred Source Address Selection" in Chapter 35 for the reason why the rt_spec_dst IP address is used.

无论地址是否写入此处显示的代码块中,因为数据包正在转发,或者由于rr_needaddr稍后设置的标志而稍后会写入,所以pointer选项字段向前移动了四个字节(IP 的大小)地址)。这解释了为什么ip_forward_options(如果我们正在处理的数据包正在转发,则将执行该操作)必须返回四个字节才能将 IP 写入正确的位置。

Whether the address is written in the block of code shown here, because the packet is being forwarded, or will be written later thanks to the flag rr_needaddr that is set later, the pointer field of the option is moved ahead four bytes (the size of the IP address). This explains why ip_forward_options (which will be executed if the packet we are processing is being forwarded) will have to go back four bytes to write the IP into the right position.

选项:时间戳

Option: Timestamp

因为optlen 代表正在分析的选项的长度,所以该if语句只是检查是否有剩余空间来存储新信息。在这种情况下,选项的长度表示发送器保留的空间(不是到目前为止使用的空间)。

Because optlen represents the length of the option being analyzed, the if statement simply checks whether any space is left to store the new information. In this case, the length of the option represents the space reserved by the transmitter (not the space used so far).

        if (optptr[2] <= optlen) {
                _ _u32 * timeptr = NULL;
        if (optptr[2] <= optlen) {
                _ _u32 * timeptr = NULL;

该选项的处理取决于第18章图18-8sub-type字段指定的子选项,但这三个子选项的处理方式相同。无论子类型如何,无论谁要处理该选项都需要两条信息(将存储在结构中):ip_option

The handling of the option depends on the suboption specified by the sub-type field in Figure 18-8 in Chapter 18, but the three suboptions are handled in the same general way. Regardless of the subtype, whoever is going to handle the option needs two pieces of information (which will be stored in the ip_option structure):

  • 是否必须记录地址、时间戳或两者

  • Whether it must record an address, a timestamp, or both

  • 必须在 IP 标头中写入信息的位置(偏移量)

  • Where in the IP header the information has to be written (the offset)

TS_ONLY如果需要记录时间戳(这对于和情况都是如此 TS_TSANDADDR),timeptr则将被初始化以指向应将其写入 IP 标头内的正确位置。另请注意,timeptr仅当 不为 NULL 时才初始化skb,即选项属于入口数据包(而不是本地生成的数据包)时的情况。

If a timestamp needs to be recorded (this would be true for the TS_ONLY and TS_TSANDADDR cases), timeptr would be initialized to point to the right place where it should be written inside the IP header. Note also that timeptr is initialized only when skb is not NULL, which is the case when the option belongs to an ingress packet (as opposed to one that is locally generated).

我们已经在“选项解析”部分中看到,ip_options_compile在处理本地生成的数据包时也可以调用它。在这种情况下,skb将为 NULL,因此timeptr不会被初始化(即,它将保留为 NULL)并且不会在标头中记录任何时间戳。这里没有什么问题,因为时间戳将被放在那里ip_options_build。该函数将存储时间戳,因为opt->ts_needtime等于 1。

We already saw in the section "Option Parsing" that ip_options_compile can also be called when handling locally generated packets. In that case, skb would be NULL, so timeptr would not be initialized (i.e., it would be left NULL) and no timestamp would be recorded in the header. There is nothing wrong here, because the timestamp will be put there by ip_options_build. That function will store the timestamp because opt->ts_needtime equals 1.

处理要转发的入口数据包和本地生成的数据包之间的唯一区别在于,在前一种情况下,时间戳会添加到 IP 标头中,并且必须重新计算校验和(因此也需要设置)opt->is_changed

The only difference between processing an ingress packet to be forwarded and a locally generated packet is that in the former case, a timestamp is added to the IP header and the checksum has to be recomputed (so opt->is_changed needs to be set as well).

当子代码为 时IPOPT_TS_PRESPEC,仅当下一个要匹配的 IP 地址是系统本地时才需要添加时间戳。用于进行该检查的函数是 inet_addr_type;主要返回值如下:

When the subcode is IPOPT_TS_PRESPEC, the timestamp has to be added only when the next IP address to match is local to the system. The function used to make that check is inet_addr_type; here are the main return values:

RTN_LOCAL
RTN_LOCAL

IP地址属于本地接口。

The IP address belongs to a local interface.

RTN_UNICAST
RTN_UNICAST

IP地址根据路由表可达,并且是单播的。

The IP address is reachable according to the routing table and is unicast.

RTN_MULTICAST
RTN_MULTICAST

该地址是多播的。

The address is multicast.

RTN_BROADCAST
RTN_BROADCAST

地址已广播。

The address is broadcast.

由于本地广播和注册的多播地址可以被认为是本地的(即系统侦听的地址),因此下面的检查代码RTN_UNICAST完全符合我们的要求 - 它确定该地址是否是本地的:

Since local broadcasts and registered multicast addresses could be considered local (i.e., addresses the system listens to), the following piece of code that checks RTN_UNICAST does exactly what we want—it determines whether the address is local:

        {
            u32 地址;
            memcpy(&addr, &optptr[optptr[2]-1], 4);
            if (inet_addr_type(addr) == RTN_UNICAST)
                休息;
            如果(skb)
                timeptr = (_ _u32*)&optptr[optptr[2]+3];
        }
        选择->ts_needtime = 1;
        {
            u32 addr;
            memcpy(&addr, &optptr[optptr[2]-1], 4);
            if (inet_addr_type(addr) == RTN_UNICAST)
                break;
            if (skb)
                timeptr = (_ _u32*)&optptr[optptr[2]+3];
        }
        opt->ts_needtime = 1;

根据正在处理的子选项,时间戳必须写入 IP 标头内的不同偏移处。第一部分timeptr相应地初始化,第二部分将时间戳复制到正确的位置。根据子选项,ts_needtimetr_needaddr标志也会被初始化。

Depending on the suboption being processed, the timestamp has to be written at a different offset within the IP header. The first part initializes timeptr accordingly, and the second part copies the timestamp to the right position. Depending on the suboption, the ts_needtime and tr_needaddr flags are also initialized.

        如果(timeptr){
            结构 timeval 电视;
            _ _u32 中场;
            do_gettimeofday(&tv);
            中间时间 = htonl((tv.tv_sec % 86400) * 1000 + tv.tv_usec / 1000);
            memcpy(timeptr, &midtime, sizeof(__u32));
            选择-> is_changed = 1;
        }
        if (timeptr) {
            struct timeval tv;
            _ _u32  midtime;
            do_gettimeofday(&tv);
            midtime = htonl((tv.tv_sec % 86400) * 1000 + tv.tv_usec / 1000);
            memcpy(timeptr, &midtime, sizeof(_ _u32));
            opt->is_changed = 1;
        }

最后一部分负责处理我们在第 18 章“时间戳选项”部分中描述的计数器溢出问题。

This last part takes care of the counter overflow we described in the section "Timestamp Option" in Chapter 18.

        无符号溢出 = optptr[3]>>4;
        如果(溢出== 15){
            pp_ptr = optptr + 3;
            转到错误;
        }
        opt->ts = optptr - iph;
        如果(skb){
            optptr[3] = (optptr[3]&0xF)|((溢出+1)<<4);
            选择-> is_changed = 1;
        }
        unsigned overflow = optptr[3]>>4;
        if (overflow == 15) {
            pp_ptr = optptr + 3;
            goto error;
        }
        opt->ts = optptr - iph;
        if (skb) {
            optptr[3] = (optptr[3]&0xF)|((overflow+1)<<4);
            opt->is_changed = 1;
        }

选项:路由器警报

Option: Router Alert

正如我们在第 18 章“路由器警报选项”部分中所解释的,该选项的最后两个字节必须为零。如果此选项通过健全性检查,则初始化该标志,以便稍后进行相应的处理。(简单地视为布尔值、零或非零。)ip_options_compilerouter_alertip_forwardopt->router_alert

As we explained in the section "Router Alert Option" in Chapter 18, the last two bytes of this option must be zero. If this option passes the sanity check, ip_options_compile initializes the router_alert flag so that later ip_forward will handle it accordingly. (opt->router_alert is simply treated as Boolean, zero, or nonzero.)

            if (optptr[2] == 0 && optptr[3] == 0)
                opt->router_alert = optptr - iph;
            if (optptr[2] == 0 && optptr[3] == 0)
                opt->router_alert = optptr - iph;

处理解析错误

Handling parsing errors

如果在本地生成的数据包 ( ) 中发现错误skb==NULL,则该函数仅返回一个必须由调用者处理的错误。相反,如果在收到的 IP 数据包上发现它,则必须将 ICMP 错误消息发送回源:

If the error was found in a locally generated packet (skb==NULL), the function simply returns an error that will have to be handled by the caller. If instead it was found on a received IP packet, an ICMP error message has to be sent back to the source:

错误:
    如果(skb){
        icmp_send(skb, ICMP_PARAMETERPROB, 0, htonl((pp_ptr-iph)<<24));
    }
    返回-EINVAL;
}
error:
    if (skb) {
        icmp_send(skb, ICMP_PARAMETERPROB, 0, htonl((pp_ptr-iph)<<24));
    }
    return -EINVAL;
}



[ * ]不要将数据片段与 IP 片段混淆。宏的使用请参见第2章skb_shinfo

[*] Do not confuse data fragments with IP fragments. See Chapter 2 for the use of the skb_shinfo macro.

[ * ]从 L2 角度来看,有效负载是 IP 标头及其后面的所有内容。

[*] From the L2 perspective, the payload is the IP header and everything that follows it.

[ * ] IP 协议规范 (RFC 791) 规定,互联网主机必须能够转发 68 字节的数据报,而不必对其进行分段:换句话说,L2 协议必须能够传输有效负载为至少 68 字节。

[*] The IP protocol specification (RFC 791) says that an Internet host must be able to forward a datagram of 68 bytes without having to fragment it: in other words, the L2 protocol must be able to transmit a frame with a payload of at least 68 bytes.

[ ]有关分段缓冲区的示例,请参阅第 21 章。

[] See Chapter 21 for examples of what a fragmented buffer looks like.

[ * ] 20 字节是不带选项的 IP 标头的长度。

[*] 20 bytes is the length of an IP header without options.

[ * ]阅读代码时,请确保不要将l用作循环索引的变量for与整数 1 混淆。它们看起来完全相同,如果您尝试理解代码,很容易浪费一个小时的时间迷惑他们。这已经发生在一个人身上了。

[*] While reading the code, make sure you do not confuse the variable l, used as the index of the for loop, with the integer 1. They look quite the same and it is easy to lose an hour trying to understand the code if you confuse them. It has already happened to one person.

[ * ]还有一些其他 IP 选项,例如 IP MTU 发现选项 (RFC 1063),这些选项已被定义,但在过去几年中从未真正使用或发现有用,因此已过时。特别是 IP MTU 发现已被路径 MTU 发现所取代(RFC 1191,第 18 章“路径 MTU 发现”部分中介绍)。

[*] There are some other IP options, such as the IP MTU Discovery Option (RFC 1063), that were defined but never really used or found useful in past years, and that were therefore made obsolete. IP MTU Discovery in particular has been replaced by path MTU discovery (RFC 1191, covered in the section "Path MTU discovery" in Chapter 18).

[ * ]这是通过调用来完成的ip_options_build。参见第 21 章

[*] This is done by calling ip_options_build. See Chapter 21.

第 20 章 Internet 协议版本 4 (IPv4):转发和本地传送

Chapter 20. Internet Protocol Version 4 (IPv4): Forwarding and Local Delivery

在函数结束时ip_rcv_finish,如果目标地址与本地接口不同,内核必须将数据包转发到适当的主机。另一方面,如果目标地址是本地的,则内核必须准备数据包以供更高层使用。正如第 19 章“ ip_rcv_finish 函数”一节中所讨论的,正确的选择是 通过调用从缓冲区中获取的。现在让我们看看这两个任务(转发skbdst_input和本地交付)已完成。

At the end of the ip_rcv_finish function, if the destination address is different from the local interface, the kernel has to forward packets to the appropriate host. On the other hand, if the destination address is local, the kernel has to prepare the packet for use by higher layers. As discussed in the section "The ip_rcv_finish Function" in Chapter 19, the correct choice is taken from the skb buffer through a call to dst_input. Let's see now how the two tasks (forwarding and local delivery) are accomplished.

转发

Forwarding

与上一章中描述的许多网络活动一样,转发分为两个功能:ip_forwardip_forward_finish。如果 Netfilter 允许,第二个在第一个的末尾调用。这两个函数都在net/ipv4/ip_forward.c中定义。

As with many networking activities described in the previous chapter, forwarding is split into two functions: ip_forward and ip_forward_finish. The second is called at the end of the first, if Netfilter allows it. Both functions are defined in net/ipv4/ip_forward.c.

此时,由于第 19 章ip_route_inputip_rcv_finish描述的调用,缓冲区包含了转发数据包所需的所有信息。转发包括以下步骤:sk_buff

By this time, thanks to the call to ip_route_input in ip_rcv_finish described in Chapter 19, the sk_buff buffer contains all the information needed to forward the packet. Forwarding consists of the following steps:

  1. 处理 IP 选项。如果 IP 标头中的选项需要,这可能涉及记录本地 IP 地址和时间戳。

  2. Process the IP options. This may involve recording the local IP address and a timestamp if options in the IP header require them.

  3. 根据 IP 标头字段确保数据包可以转发。

  4. Make sure that the packet can be forwarded, based on the IP header fields.

  5. 减少 IP 标头的生存时间 (TTL) 字段,如果 TTL 变为 0,则丢弃数据包。

  6. Decrement the Time To Live (TTL) field of the IP header and discard the packet if the TTL becomes 0.

  7. 如果需要,根据与路由关联的 MTU 处理分段。

  8. Handle fragmentation if needed, based on the MTU associated with the route.

  9. 将数据包发送到传出设备。

  10. Send the packet out to the outgoing device.

如果由于某种原因无法转发数据包,则必须使用描述所遇到问题的 ICMP 消息来通知源主机。即使数据包将被转发,ICMP 消息也可以作为警告发送,例如当数据包使用次优路由进行路由并触发重定向时。在以下各节中,我们将研究函数中的这些活动和其他活动ip_forward

If the packet cannot be forwarded for some reason, the source host has to be notified with an ICMP message that describes the problem encountered. An ICMP message could also be sent as a warning even if the packet will be forwarded, as when a packet is routed with a suboptimal route and triggers a redirect. In the following sections, we'll examine these and other activities in the ip_forward function.

xfrm4_ xxx与 IPsec 的交互是转发的主要部分,并由中的例程实现ip_forward,这些例程是 IPsec 基础设施的挂钩。由于篇幅原因,本书并未介绍这些内容。此处记录的行为是未配置 IPsec 时转发的工作方式,在这种情况下,这些调用将变为无操作。

Interaction with IPsec is a major part of forwarding, and is implemented by xfrm4_ xxx routines in ip_forward, which are hooks into the IPsec infrastructure. They are not covered in this book for lack of space. The behavior documented here is how forwarding works when IPsec is not configured, in which case those calls simply becomes no-ops.

ICMP 重定向

ICMP Redirect

当主机系统(通常是路由器)被要求做其他路由器更适合做的事情时,就会发送 ICMP 重定向消息(更多详细信息,请参阅第 25 章和第31

An ICMP redirect message is sent by a host system (usually a router) when it has been asked to do something that another router is better suited to do (see Chapters 25 and 31 for more details).

当数据包经过源路由后,路由器会假定发送方请求路由有充分的理由,并且不会事后猜测。它尊重所请求的路由并且不发送 ICMP 重定向消息。“ ip_forward 函数”部分介绍了这种特殊情况。

When a packet has been source routed, the router assumes the sender had a good reason for requesting the route and does not second-guess it. It honors the requested route and does not send an ICMP redirect message. This special case is covered in the section "ip_forward Function."

ip_forward 函数

ip_forward Function

正如我们所看到的,ip_forward被调用ip_rcv_finish (参见第18章中的图18-1)来处理所有未寻址到本地系统的输入数据包。该函数接收与数据包关联的缓冲区作为输入参数;所有必要的信息都在该结构内。,路由信息,是通过调用代码路径前面的in来初始化的(更多细节请参见第 33 章)。skbskb->dstip_route_inputip_rcv_finish

As we have seen, ip_forward is invoked by ip_rcv_finish (see Figure 18-1 in Chapter 18) to handle all input packets that are not addressed to the local system. The function receives as an input parameter the buffer skb associated with the packet; all the necessary information is inside that structure. skb->dst, the routing information, was initialized by the call to ip_route_input in ip_rcv_finish earlier in the code path (see Chapter 33 for more details).

图 20-1总结了该函数的内部结构:

Figure 20-1 summarizes the internals of the function:

int ip_forward(struct sk_buff *skb)
int ip_forward(struct sk_buff *skb)
ip_forward 函数

图 20-1。ip_forward 函数

Figure 20-1. ip_forward function

该函数围绕skb 局部变量 和 的操作iph,该变量表示数据包的 IP 标头并从 字段重复iph初始化skb。(它必须重新初始化,因为标头可以被从 调用的某些函数更改 ip_forward。)

The function revolves around manipulations of skb and of a local variable iph, which represents the packet's IP header and is initialized repeatedly from the iph field of skb. (It has to be reinitialized because the header can be changed by some of the functions called from ip_forward.)

如果在 IP 标头中找到路由器警报选项,则立即对其进行处理。[ * ]该选项的函数处理程序是ip_call_ra_chain,它依赖于全局列表 ( ip_ra_chain),该列表包含设置该选项的本地套接字列表IP_ROUTER_ALERT,因为它们对携带 Router Alert IP 选项的 IP 数据包感兴趣。当入口IP数据包被分片时, ip_call_ra_chain首先对整个IP数据包进行碎片整理,然后将其传递到列表中的Raw套接字,如第18章中的图18-1ip_ra_chain所示。

If the Router Alert option was found in the IP header, it is handled now.[*] The function handler for this option is ip_call_ra_chain, which relies on a global list (ip_ra_chain) that contains the list of local sockets that set the IP_ROUTER_ALERT option because they are interested in IP packets that carry the Router Alert IP option. When an ingress IP packet is fragmented, ip_call_ra_chain first defragments the entire IP packet and only then delivers it to the Raw sockets of the ip_ra_chain list, as shown in Figure 18-1 in Chapter 18.

管理警报的函数可以在net/ipv4/ip_sockglue.c中找到(例如,请参阅ip_ra_control如何根据ip_setsockopt用户请求通过调用setsockopt系统调用将选项应用于套接字来调用它)。ip_forward没有进一步的工作要做,并返回成功。

The functions that manage the alert can be found in net/ipv4/ip_sockglue.c (see, for example, ip_ra_control and how it is called by ip_setsockopt to apply an option to a socket as requested by the user with a call to the setsockopt system call). ip_forward has no further work to do, and returns success.

如果标头中没有 Router_Alert 选项,或者存在该选项但没有感兴趣的进程正在运行(在这种情况下ip_call_ra_chain返回 FALSE),ip_forward则继续:

If there is no Router_Alert option in the header, or if it is present but no interested processes are running (in which case ip_call_ra_chain returns FALSE), ip_forward continues:

    if (IPCB(skb)->opt.router_alert && ip_call_ra_chain(skb))
        返回NET_RX_成功;
    if (IPCB(skb)->opt.router_alert && ip_call_ra_chain(skb))
        return NET_RX_SUCCESS;

以下检查仅用于确保我们正在处理的数据包实际上是发送到 L2 主机的。skb->pkt_type在L2层初始化(参见第13章),并定义帧的类型。PACKET_HOST当帧被寻址到接收接口的 L2 地址时,它被分配该值。如果较低级别的函数正确完成其工作,则不需要进行此检查,但我们这样做只是为了防止错误留下我们本来不应该收到的数据包。

The following check is used just to make sure that the packet we're handling was actually addressed to our host at L2. skb->pkt_type is initialized at the L2 layer (see Chapter 13), and defines the type of frame. It is assigned the value PACKET_HOST when the frame is addressed to the L2 address of the receiving interface. If the lower-level functions do their jobs correctly, there should be no need for this check, but we do it just in case an error left us with a packet we should not have received in the first place.

    if (skb->pkt_type != PACKET_HOST)
        转到下降;
    if (skb->pkt_type != PACKET_HOST)
        goto drop;

由于我们转发数据包,因此我们完全在 L3 层操作,担心 L4 校验和不是我们的事;我们用它CHECKSUM_NONE来表明当前的校验和是正确的。如果稍后某些处理更改了 IP 标头或 TCP 标头或有效负载,则在传输之前,内核将在那里重新计算校验和。

Since we are forwarding the packet, we are operating entirely at the L3 layer and it is not our business to worry about the L4 checksum; we use CHECKSUM_NONE to indicate that the current checksum is OK. If some handling changes the IP header or the TCP header or payload later, before transmission, the kernel will recalculate the checksum there.

    skb->ip_summed = CHECKSUM_NONE;
    skb->ip_summed = CHECKSUM_NONE;

真正的转发过程是从递减开始的TTL 字段。IP 协议定义规定,当 TTL 达到值 0 时(这意味着您收到的值为 1,递减后变为 0),数据包必须被丢弃,并且必须向源发送一个特殊的 ICMP 消息让它知道你丢弃了数据包。

The real forwarding process starts by decrementing the TTL field. The IP protocol definition says that when TTL reaches the value of 0 (which means you received it with value 1 and it became 0 after you decremented it), the packet has to be dropped and a special ICMP message has to be sent to the source to let it know you dropped the packet.

    if (iph->ttl <= 1)
            转到too_many_hops;
    if (iph->ttl <= 1)
            goto too_many_hops;

注意,TTL字段还没有递减;稍后将通过几行代码完成。等待的原因是此时数据包仍可能与其他子系统(例如嗅探器)共享;在这种情况下,标头必须保持不变。

Note that the TTL field has not been decremented yet; it will be done a few lines of code later. The reason for waiting is that the packet may still be shared at this point with other subsystems such as sniffers; the header must be unchanged in that case .

rt指向 类型的数据结构rtable,其中包含转发引擎所需的所有信息,包括下一跳 ( rt_gateway)。如果 IP 标头包含“严格源路由”选项,并且下一跳(从该选项中提取)与路由子系统找到的下一跳不匹配,则“源路由”选项将失败并且数据包将被丢弃。

rt points to a data structure of type rtable, which contains all the information needed by the forwarding engine, including the next hop (rt_gateway). If the IP header contains a Strict Source Route option and the next hop (extracted from that option) does not match the one found by the routing subsystem, the Source Routing option fails and the packet is dropped.

    rt = (struct rtable*)skb->dst;
    if (opt->is_strictroute && rt->rt_dst != rt->rt_gateway)
            转到 sr_失败;
    rt = (struct rtable*)skb->dst;
    if (opt->is_strictroute && rt->rt_dst != rt->rt_gateway)
            goto sr_failed;

在这种情况下,另一条 ICMP 消息将传输至发送方。

In this case, another ICMP message is transmitted to the sender.

完成大部分健全性检查后,该函数会稍微更新数据包标头,然后将其提供给ip_forward_finish. 由于我们要修改缓冲区的内容,因此我们需要为自己制作一个本地副本。实际上,skb_cow仅当数据包共享(如果数据包不共享,则可以安全地修改)或者数据包头部的可用空间不足以存储 L2 标头时,才会实际完成复制。

After most of the sanity checks have been fulfilled, the function updates the packet header a bit and then gives it to ip_forward_finish. Since we are about to modify the content of the buffer, we need to make a local copy for ourselves. The copy is actually done by skb_cow only if the packet is shared (if the packet is not shared it can be safely modified) or if the space available at the head of the packet is not sufficient to store the L2 header.

    if (skb_cow(skb, LL_RESERVED_SPACE(rt->u.dst.dev)+rt->u.dst.header_len))
        转到下降;
    if (skb_cow(skb, LL_RESERVED_SPACE(rt->u.dst.dev)+rt->u.dst.header_len))
        goto drop;

现在 TTL 递减ip_decrease_ttl,这也会更新 IP 校验和。

Now the TTL is decremented by ip_decrease_ttl, which also updates the IP checksum.

    ip_decrease_ttl(iph);
    ip_decrease_ttl(iph);

如果有比所请求的下一跳更好的可用下一跳,则现在会通过 ICMP REDIRECT 消息通知始发主机,但前提是始发主机未请求源路由。该opt->srr字段表示请求了源路由,在这种情况下,始发主机并不关心是否找到了所谓的更好的路由。在第 35 章中,您将看到何时RTCF_DOREDIRECT在缓存的路由上设置标志,以指示应向数据包的源发送 ICMP REDIRECT 消息。

If a better next hop is available than the requested one, the originating host is now notified with an ICMP REDIRECT message—but only if the originating host did not request source routing. The opt->srr field indicates that source routing was requested, in which case the originating host doesn't care whether a supposedly better route is found. In Chapter 35 you will see when exactly the RTCF_DOREDIRECT flag is set on a cached route to indicate that the source of the packet should be sent an ICMP REDIRECT message.

    if (rt->rt_flags&RTCF_DOREDIRECT && !opt->srr)
        ip_rt_send_redirect(skb);
    if (rt->rt_flags&RTCF_DOREDIRECT && !opt->srr)
        ip_rt_send_redirect(skb);

priority此处使用 IP 标头的服务类型字段来设置该字段。该优先级稍后将由流量控制(QoS 层)使用。

The priority field is set here using the Type of Service field of the IP header. The priority will be used later by Traffic Control (the QoS layer).

    skb->优先级 = rt_tos2priority(iph->tos);
    skb->priority = rt_tos2priority(iph->tos);

ip_forward_finish如果不存在禁止转发的过滤规则,则该函数将通过要求 Netfilter 执行来终止。

The function terminates by asking Netfilter to execute ip_forward_finish, if there are no filtering rules that forbid forwarding.

    返回 NF_HOOK(PF_INET, NF_IP_FORWARD, skb, skb->dev, rt->u.dst.dev,
               ip_forward_finish);
    return NF_HOOK(PF_INET, NF_IP_FORWARD, skb, skb->dev, rt->u.dst.dev,
               ip_forward_finish);

ip_forward_finish 函数

ip_forward_finish Function

如果达到此功能,则意味着数据包已通过所有可能阻止它的检查,并且真正准备好发送到另一个系统。

If this function is reached, it means the packet has passed all the checks that could stop it and is truly ready to be sent out to another system.

到目前为止,已经处理了 IP 标头中的两个可能的选项,正如我们在“ ip_forward 函数”部分中看到的那样:路由器警报和严格源路由。现在我们将数据包传递给函数ip_forward_options来处理选项所需的任何最终工作。它可以通过检查先前由 调用的标志(例如opt->rr_needaddr)和偏移量(例如)来找出需要执行的操作。如果必须更新任何 IP 标头字段,还会重新计算 IP 校验和。请参阅第 19 章中的“选项处理” 部分。opt->rrip_options_compileip_rcv_finiship_forward_options

Two possible options from the IP header have been handled so far, as we saw in the section "ip_forward Function": Router Alert and Strict Source Routing. Now we pass the packet to the function ip_forward_options to handle any final work required by the options. It can find out what needs to be done by checking flags (such as opt->rr_needaddr) and offsets (such as opt->rr) initialized earlier by ip_options_compile, which was invoked from ip_rcv_finish. ip_forward_options also recomputes the IP checksum in case it had to update any of the IP header fields. See the section "Option Processing" in Chapter 19.

数据包最终以 进行传输dst_output,如下一节所述:

The packet is finally transmitted with dst_output, described in the next section:

静态内联 int ip_forward_finish(struct sk_buff *skb)
{
    struct ip_options * opt = &(IPCB(skb)->opt);

    IP_INC_STATS_BH(IPSTATS_MIB_OUTFORWDDATAGRAMS);

    如果(不太可能(opt->optlen){
            ip_forward_options(skb);

    返回 dst_output(skb);
}
static inline int ip_forward_finish(struct sk_buff *skb)
{
    struct ip_options * opt = &(IPCB(skb)->opt);

    IP_INC_STATS_BH(IPSTATS_MIB_OUTFORWDDATAGRAMS);

    if (unlikely(opt->optlen) {
            ip_forward_options(skb);

    return dst_output(skb);
}

看起来我们已经接近线路了,但是在让设备驱动程序进行传输之前仍然有一些任务要做。

It may seem we are close to the wire, but there are still a couple of tasks to do before having the device driver do the transmission.

dst_output 函数

dst_output Function

dst_output所有传输,无论是本地生成的还是从其他主机转发的,都会在到达目标主机的途中经过,如第 18 章中的图 18-1所示。此时 IP 标头已完成:它包含了传输所需的信息以及本地系统负责添加的任何其他信息。

All transmissions, whether generated locally or forwarded from other hosts, pass through dst_output on their way to a destination host, as shown in Figure 18-1 in Chapter 18. The IP header at this point is finished: it embodies the information needed to transmit as well as any other information the local system was responsible for adding.

静态内联 int dst_output(struct sk_buff *skb)
{
        内部错误;

        为了 (;;) {
                err = skb->dst->输出(&skb);
                如果(可能(错误= 0))
                        返回错误;
                如果(不太可能(错误!= NET_XMIT_BYPASS))
                        返回错误;
        }
}
static inline int dst_output(struct sk_buff *skb)
{
        int err;

        for (;;) {
                err = skb->dst->output(&skb);
                if (likely(err = 0))
                        return err;
                if (unlikely(err != NET_XMIT_BYPASS))
                        return err;
        }
}

dst_output调用虚拟函数output,该函数已初始化为ip_output目标地址是否为单播以及ip_mc_output是否为多播。碎片是在该函数中处理的。最后,ip_finish_output调用它与相邻子系统进行交互(参见第 18 章中的图 18-1)。第 21 章“与相邻子系统的接口”一节中描述的,仅当 Netfilter 发出绿灯时才被调用(否则,数据包将被丢弃)。ip_finish_output

dst_output invokes the virtual function output, which has been initialized to ip_output if the destination address is unicast and ip_mc_output if it is multicast. Fragmentation is handled in that function. At the end, ip_finish_output is called to interface with the neighboring subsystem (see Figure 18-1 in Chapter 18). ip_finish_output, described in the section "Interface to the Neighboring Subsystem" in Chapter 21, is invoked only if Netfilter gives the green light (otherwise, the packet is dropped).

请注意,output如果该函数返回值,则可能会被多次调用NET_XMIT_BYPASS。例如,这是一种调用一系列output例程的简单机制。IPsec 协议套件使用它在实际传输之前应用转换。

Note that the output function can potentially be invoked multiple times if it returns the NET_XMIT_BYPASS value. This is, for instance, a simple mechanism to call a sequence of output routines. The IPsec protocol suite uses it to apply transformations before the real transmission.

本地配送

Local Delivery

第35章解释了转发(路由)引擎如何知道本地主机是数据包的目的地。我们在第 19 章“ ip_rcv_finish 函数”一节的末尾看到,在 的顶部对 的调用会在 数据包到达其目标主机时初始化(与需要转发时的 相反)。此外,Netfilter 拥有最终决定权来决定是否允许泛型函数(例如)调用相应的 函数(在本例中为)来完成工作。ip_route_inputip_rcv_finishskb->dst->inputip_local_deliverip_forwarddo_somethingip_local_deliverdo_something _finiship_local_deliver_finish

Chapter 35 explains how the forwarding (routing) engine knows that the local host is the packet's destination. We saw at the end of the section "The ip_rcv_finish Function" in Chapter 19 that the call to ip_route_input, at the top of ip_rcv_finish, initializes skb->dst->input to ip_local_deliver when the packet has reached its destination host (as opposed to ip_forward, when it needs to be forwarded). Furthermore, Netfilter is given the final right to decide whether the generic do_something function (such as ip_local_deliver) is allowed to call the corresponding do_something _finish function (in this case, ip_local_deliver_finish) to complete the job.

int ip_local_deliver(struct sk_buff *skb)
{
    if (skb->nh.iph->frag_off & htons(IP_MF|IP_OFFSET)) {
        skb = ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER);
        如果(!skb)
            返回0;
    }

    返回 NF_HOOK(PF_INET, NF_IP_LOCAL_IN, skb, skb->dev, NULL,
               ip_local_deliver_finish);
}
int ip_local_deliver(struct sk_buff *skb)
{
    if (skb->nh.iph->frag_off & htons(IP_MF|IP_OFFSET)) {
        skb = ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER);
        if (!skb)
            return 0;
    }

    return NF_HOOK(PF_INET, NF_IP_LOCAL_IN, skb, skb->dev, NULL,
               ip_local_deliver_finish);
}

与转发相比,碎片整理大多可以被忽略,本地传递必须做很多工作来处理碎片整理。除了特殊情况(例如当 Netfilter 必须对数据包进行碎片整理以检查其内容时)之外,可以对每个碎片执行转发,而无需尝试重​​新组合它们。相比之下,原始 IP 数据包必须始终进行碎片整理并作为一个整体进行本地传递,因为更高的 L4 层应该完全不知道 IP 层分段的需要。

In contrast to forwarding, where defragmentation can mostly be ignored, local delivery has to do a lot of work to handle defragmentation. Except for special cases (such as when Netfilter must defragment a packet to examine its contents), forwarding can be performed on each fragment without trying to recombine them. In contrast, the original IP packet must always be defragmented and passed as a whole for local delivery, because that higher L4 layer is supposed to be blissfully ignorant of the need for fragmentation at the IP layer.

碎片整理是在ip_defrag 函数内执行的,当碎片整理完成时,该函数返回指向原始数据包的指针,如果碎片整理仍未完成,则返回 NULL。ip_local_deliver调用中显示的代码ip_defrag检查局部变量中的返回值skb,如果数据包不完整则返回。第二个输入参数to在第23 章的“ ipq 结构ip_defrag部分中描述。

Defragmentation is performed within the ip_defrag function, which returns a pointer to the original packet when it has been completely defragmented, and NULL if it is still incomplete. The code shown from ip_local_deliver calls ip_defrag, checks the return value in the local skb variable, and returns if the packet is incomplete. The second input parameter to ip_defrag is described in the section "ipq Structure" in Chapter 23.

只有对数据包进行碎片整理后,该功能才能将其传送。Netfilter 被要求查阅其配置并ip_local_deliver_finish在数据包被接受时执行。ip_local_deliver_finish我们将在第 24 章中详细介绍 。碎片整理已在第18章“数据包碎片/碎片整理”一节中介绍,并将在第22章详细介绍。

Only when the packet is defragmented can the function deliver it. Netfilter is asked to consult its configuration and execute ip_local_deliver_finish if the packet is accepted. We will cover the details of ip_local_deliver_finish in Chapter 24. Defragmentation was introduced in the section "Packet Fragmentation/Defragmentation" in Chapter 18 and will be shown in detail in Chapter 22.




[ * ]请参阅第 18 章中的“路由器警报选项”部分。

[*] See the section "Router Alert Option" in Chapter 18.

第 21 章 Internet 协议版本 4 (IPv4):传输

Chapter 21. Internet Protocol Version 4 (IPv4): Transmission

在本章中,我们讨论L3层的数据包传输,它位于第18章18-1的左上角。传输是指数据包离开本地主机到另一个主机;它可以由 L4 层发起或作为转发的最后阶段调用。如第 18 章中的图 18-1所示,传递数据包的中心功能是:本章中描述的功能位于它之前并为其准备数据包。此阶段内核的任务包括:dst_output

In this chapter, we discuss packet transmission at the L3 layer, which fits into the top-left corner of Figure 18-1 in Chapter 18. Transmission refers to packets leaving the local host for another; it can be initiated by the L4 layer or be invoked as the final stage of forwarding. As shown in Figure 18-1 in Chapter 18, the central function that delivers a packet is dst_output; the functions described in this chapter precede it and prepare packets for it. The tasks of the kernel at this stage include:

寻找下一跳
Looking up the next hop

IP层需要知道出站设备和用于下一跳的下一个路由器。路由是通过ip_route_output_flow在 L3 或 L4 层调用的函数 找到的。本章不讨论路由,因为该主题足够大,可以单独讨论,因此在第七部分中进行了介绍。

The IP layer needs to know the outgoing device and the next router to use for the next hop. The route is found through the function ip_route_output_flow, called at the L3 or L4 layer. This chapter does not discuss routing, because that subject is big enough for its own discussion and is therefore covered in Part VII.

初始化IP头
Initializing the IP header

在此阶段填写多个字段,例如数据包 ID。如果数据包是转发数据包,则之前会对标头进行一些工作(例如更新 TTL、校验和和选项字段)。但此时必须做更多工作才能实现传输。

Several fields, such as the packet ID, are filled in at this stage. If the packet is a forwarded one, a little work was done on the header earlier (such as updating the TTL, checksum, and options fields). But much more must be done at this point to enable transmission.

加工选项
Processing options

软件必须遵守需要在标头中添加地址或时间戳的选项。

The software has to honor options that require the addition of an address or timestamp to the header.

碎片化
Fragmentation

如果 IP 数据包太大而无法在传出设备上传输,则必须对其进行分段(除非明确禁止分段)。

If the IP packet is too big to be transmitted on the outgoing device, it must be fragmented (unless fragmentation is explicitly forbidden).

校验和
Checksum

IP 校验和必须在标头的所有其他工作完成后计算。我们将看到 IP 层可能会处理 L4 校验和以及 L3 校验和。在这两种情况下,都可以一次性或增量地计算校验和。虽然需要校验和,但 L3 层并不总是需要计算它,因为某些设备的硬件会计算它(如标志所示 CHECKSUM_HW)。

The IP checksum has to be computed after all other work on the header is done. We will see that the IP layer may take care of the L4 checksum as well as the L3 checksum . In both cases, the checksum can be computed either in one shot or incrementally. While the checksum is required, the L3 layer doesn't always have to calculate it, because some devices' hardware does it (as denoted by the CHECKSUM_HW flag).

使用 Netfilter 检查
Checking with Netfilter

正如第 18 章中的图 18-1所示,Linux 防火墙系统有机会在处理的各个阶段(包括传输)丢弃或破坏每个数据包。

As shown in Figure 18-1 in Chapter 18, the Linux firewall system is given a chance to drop or mangle each packet at various stages of processing, including transmission.

更新统计数据
Updating statistics

根据传输结果(成功或失败)和分段等操作,必须更新关联的 SNMP 计数器。

Depending on the result of the transmission (success or failure) and on actions such as fragmentation, the associated SNMP counters have to be updated.

选项处理和碎片化是迄今为止最昂贵的任务;第 22 章讨论了碎片问题,第 19 章讨论了选项问题。过去有两个不同的功能 传输,一种用于可以快速传输的数据包,因为它们不需要分片或 IP 选项处理,另一种提供所有服务。内核不再明确区分这两种情况。

Option processing and fragmentation are by far the most expensive tasks; fragmentation is addressed in Chapter 22, and options were addressed in Chapter 19. In the past there used to be two different functions for transmission, one for packets that could be transmitted quickly because they didn't need fragmentation or IP option processing, and another that provided all the services. The kernel does not explicitly distinguish the two cases anymore.

执行传输的关键函数

Key Functions That Perform Transmission

第 18 章18-1左上角列出的两个函数出现在图 21-1中,按调用它们的 L4 协议进行分类。之所以有两组功能,是因为右侧的L4协议(TCP和流控制传输协议(SCTP))做了很多工作来准备分片;这为 IP 层留下了更少的工作。相比之下,原始 IP 和左侧列出的其他协议将所有分段工作留给了 IP 层。

The two functions listed at the top left of Figure 18-1 in Chapter 18 appear in Figure 21-1, classified by the L4 protocols that invoke them. The reason for two sets of functions is that the right-side L4 protocols (TCP and the Stream Control Transmission Protocol, or SCTP) do a lot of work to prepare for fragmentation; that leaves less work for the IP layer. In contrast, raw IP and the other protocols listed on the left side leave all of the work of fragmentation up to the IP layer.

图21-1显示了L4传输和L3最后一步之间的主要功能,L3最后一步调用第27章中讨论的邻居功能。图的顶部显示了最常见的 L4 协议。UDP和ICMP调用一组L3函数来进行传输,而TCP和SCTP调用另一组L3函数。当本章中描述的 L3 功能完成其工作时,它们将数据包传递到dst_output。对于raw IP,当它使用该IP_HDRINCL选项时,它完全负责准备IP头,因此它绕过本章描述的函数,dst_output直接调用。请参阅“原始套接字”部分” 了解更多详细信息。Internet 组管理协议 (IGMP) 也会直接调用dst_output(在自行初始化 IP 标头之后)。

Figure 21-1 shows the main functions that lie between transmission at L4 and the last step of L3, which is invoking the neighbor function discussed in Chapter 27. At the top of the figure, the most common L4 protocols are shown. UDP and ICMP call one set of L3 functions to carry out transmission, whereas TCP and SCTP call another. When the L3 functions described in this chapter finish their work, they pass packets to dst_output. As for raw IP, when it uses the IP_HDRINCL option it is completely responsible for preparing the IP header, so it bypasses the functions described in this chapter and calls dst_output directly. See the section "Raw Sockets" for more details. The Internet Group Management Protocol (IGMP) also makes a direct call to dst_output (after initializing the IP header on its own).

因此,碎片由两组函数处理,如下所示:

Thus, fragmentation is handled by the two sets of functions as follows:

ip_queue_xmit
ip_queue_xmit

L4 协议已经将数据划分为大小适合分段(如果需要)的块,同时考虑到第 18 章中讨论的 PMTU 。IP 层的工作只是将 IP 标头添加到已创建的数据片段中。

The L4 protocol has already divided the data into chunks that are sized properly for fragmentation (if it is needed), taking into account the PMTU as discussed in Chapter 18. The work at the IP layer consists simply of adding IP headers to the data fragments already created.

ip_push_pending_frames及相关功能
ip_push_pending_frames and related functions

调用此功能的 L4 协议不考虑碎片或帮助执行它。此外,为了提高效率,它们将数据包中的数据向下传递到 IP 层的方式引入了复杂性。根据本章后面介绍的几个因素,L4 协议可以通过多次调用来存储多个传输请求,而ip_append_data无需实际传输任何内容。

不同的协议调用 dst_output 的方式不同

图 21-1。不同的协议调用 dst_output 的方式不同

ip_append_data并不是简单地缓冲传输请求,而是透明地生成最佳大小的数据分片,以便IP层以后更容易处理分片。这使得 IP 层不必在制作片段时将数据从一个缓冲区复制到另一个缓冲区,从而显着提高性能。

当 L4 协议需要刷新使用创建的输出队列时ip_append_data,协议会调用ip_push_pending_frames,然后执行任何必要的分段并将生成的数据包推送到dst_output

UDP 当前使用的是ip_append_datanamed的一个变体。ip_append_page稍后我们将简要介绍该功能。

The L4 protocols invoking this function do not consider fragmentation or help perform it. Furthermore, for the sake of efficiency, they introduce complexity by their way of passing the data in the packet down to the IP layer. Depending on several factors covered later in this chapter, an L4 protocol can store several transmission requests through multiple calls to ip_append_data without actually transmitting anything.

Figure 21-1. Different protocols invoking dst_output differently

ip_append_data does not simply buffer transmission requests, but transparently generates data fragments of optimal sizes to make it easier for the IP layer to handle fragmentation later. This saves the IP layer from having to copy data from one buffer to another while making fragments, and leads to a significant performance gain.

When the L4 protocol needs to flush the output queue created with ip_append_data, the protocol invokes ip_push_pending_frames, which in turn does any necessary fragmentation and pushes the resulting packets down to dst_output.

A variant of ip_append_data named ip_append_page is currently used by UDP. We will briefly describe this function later.

在特定上下文中传输期间还使用其他函数:

Other functions are also used during transmission in specific contexts:

ip_build_and_send_pkt
ip_build_and_send_pkt

TCP 使用它来发送 SYN ACK。

Used by TCP to send SYN ACKs.

ip_send_reply
ip_send_reply

TCP 使用它来发送 ACK 和重置。图21-1的分类只涵盖了最常见的情况:由于ip_send_reply使用了ip_append_dataip_push_pending_frame,因此 TCP 不只使用ip_queue_xmit

Used by TCP to send ACKs and Resets. The classification of Figure 21-1 only covers the most common cases: since ip_send_reply uses ip_append_data and ip_push_pending_frame, it follows that TCP does not use only ip_queue_xmit.

当你理解了图 21-1中的函数如何工作之后,这些就会很容易理解 。L4协议也可以dst_output直接调用;IGMP 和 RawIP 是执行此操作的两个协议(请参阅“原始套接字”部分)。

These will be pretty easy to understand after you understand how the functions in Figure 21-1 work. It is also possible for an L4 protocol to call dst_output directly; IGMP and RawIP are two protocols that do it (see the section "Raw Sockets").

在本章中,我简要介绍了ip_queue_xmit,但在ip_append_data/上花费了更多时间ip_push_pending_frames,因为它们是复杂碎片任务的关键部分。

In this chapter, I briefly cover ip_queue_xmit, but spend more time on ip_append_data/ip_push_pending_frames because they are key parts of the complex task of fragmentation.

组播流量

Multicast Traffic

如图18中的图18-1所示,组播和组播的出站路径如下:单播流量相似 — 比入口路径更相似。本书中我不会详细讨论组播,但在本章中我将指出单播和组播在传输过程中的一些区别。例如,在“构建 IP 标头”部分中,我们将看到对于多播流量,TTL 的初始化方式不同。转发数据包时也是如此。

As shown in Figure 18-1 in Chapter 18, the egress paths followed by transmitted multicast and unicast traffic are similar—more similar than for the ingress path. I do not go into detail about multicast in this book, but in this chapter I will point out some differences between unicast and multicast during transmission. For instance, in the section "Building the IP header" we will see that the TTL is initialized differently for multicast traffic. The same is true when forwarding packets.

本地流量的相关套接字数据结构

Relevant Socket Data Structures for Local Traffic

BSD 套接字在 Linux 中用结构实例表示socket。该结构包括指向数据结构的指针sock,该数据结构是存储网络层信息的位置。数据sock结构相当大,但在include/net/sock.h中有详细记录。数据sock结构实际上被分配为特定于协议族的更大结构的一部分;对于PF_INET套接字,结构是 ,在include/linux/ip.hinet_sock中定义。第一个字段是实例,其余字段存储inet_socksockPF_INET私有信息,例如源和目标 IP 地址、IP 选项、数据包 ID、TTL 和 cork(接下来讨论)。

A BSD socket is represented in Linux with an instance of a socket structure. This structure includes a pointer to a sock data structure, which is where the network layer information is stored. The sock data structure is pretty big but is well documented in include/net/sock.h. The sock data structure is actually allocated as part of a bigger structure that is specific to the protocol family; for PF_INET sockets the structure is inet_sock, defined in include/linux/ip.h. The first field of inet_sock is a sock instance, and the rest stores PF_INET private information, such as the source and destination IP addresses, the IP options, the packet ID, the TTL, and cork (discussed next).

    结构 inet_sock {
        结构袜子sk;
        …………
        结构体{
            …………
        } 软木塞;
    }
    struct inet_sock {
        struct sock sk;
        ... ... ...
        struct {
            ... ... ...
        } cork;
    }

给定一个指向sock数据结构的指针,IP 层使用inet_sk宏将该指针转换为外部inet_sock数据结构。换句话说,inet_socksock结构的基址是相同的,这是处理复杂嵌套结构的 C 程序中常用的功能。

Given a pointer to a sock data structure, the IP layer uses the inet_sk macro to cast the pointer to the outer inet_sock data structure. In other words, the base address of the inet_sock and sock structures is the same, a feature commonly exploited in C programs that deal with complex, nested structures.

的字段在和中起着重要作用:它存储这两个函数正确执行数据碎片所需的上下文信息inet_sock。它包含的各种信息包括 IP 标头中的选项(如果有)和片段长度。corkip_append_dataip_append_page

The inet_sock's cork field plays an important role in ip_append_data and ip_append_page: it stores the context information needed by those two functions to do data fragmentation correctly. Among the various information it contains are the options in the IP header (if any) and the fragment length.

每当本地生成传输时(只有少数例外),每个 sk_buff缓冲区都与其sock实例相关联,并通过skb->sk.

Whenever a transmission is generated locally (with only a few exceptions), each sk_buff buffer is associated with its sock instance and is linked to it with skb->sk.

sock使用不同的函数来设置和读取和结构的字段值inet_sock其中一些由图 21-1中的函数调用。就这一章而言,我们只需要理解其中几个的含义:

Different functions are used to set and read the value of the fields of the sock and inet_sock structures. Some of them are called by the functions in Figure 21-1. As far as this chapter is concerned, we need to understand the meaning of only a few of them:

sk_dst_set _ _sk_dst_set
sk_dst_set and _ _sk_dst_set

连接套接字后,这些函数会将用于到达目的地的路由保存在结构中sock。是一个负责锁定的sk_dst_set简单包装器。_ _sk_dst_set如果不需要锁定(因为它已经被处理),_ _sk_dst_set可以直接调用。

Once the socket is connected, these functions save the route used to reach the destination in the sock structure. sk_dst_set is a simple wrapper to _ _sk_dst_set that takes care of locking. If locking is not needed (because it was already taken care of), _ _sk_dst_set can be called directly.

sk_dst_check _ _sk_dst_check
sk_dst_check and _ _sk_dst_check

顾名思义,可以使用这两个 API 来测试路由的有效性。但是,如果路由有效,它们会将其作为返回值返回。这意味着这些函数可用于检索路由,而不仅仅是测试有效性。(无效的路由会导致它们返回 NULL。)这两个函数非常相似;如果发现缓存的路由不再有效,它们在清除缓存路由的方式方面略有不同。

As the names suggest, the validity of the route can be tested with these two APIs. However, if the route is valid, they return it as their return value. This means that these functions can be used to retrieve the route, not just to test the validity. (An invalid route causes them to return NULL.) The two functions are very similar; they differ slightly in terms of how they clear the cached route if they find out that it is not valid anymore.

skb_set_owner_w
skb_set_owner_w

将缓冲区分配sk_buff给给定的 sock结构。这对于会计很有用。

Assigns an sk_buff buffer to a given sock structure. This is useful for accounting.

sock_alloc_send_skbsock_wmalloc
sock_alloc_send_skb and sock_wmalloc

这些函数分配sk_buff缓冲区。 sock_alloc_send_skb被调用来分配单个缓冲区或一系列缓冲区的第一个片段(请参阅 的讨论ip_append_data);sock_wmalloc处理后续片段。两者最终都会调用 alloc_skb,但第一个函数更复杂,并且可能因比第二个函数更多的原因而失败。这是因为如果第一个缓冲区的分配成功,则后续分配几乎没有失败的理由。

These functions allocate sk_buff buffers. sock_alloc_send_skb is called to allocate a single buffer or the first fragment of a series (see the discussion of ip_append_data); sock_wmalloc takes care of subsequent fragments. Both end up calling alloc_skb, but the first function is more complex and can fail for more reasons than the second. This is because if allocation of the first buffer succeeds, the following allocations have few reasons to fail.

本章许多函数中出现的另一个数据结构是与数据包相关的路由表缓存条目rtable。许多函数通过名为 的变量来引用它rt。它包含出局设备、出局设备的MTU、下一跳网关等信息。该结构由第 36 章初始化 ip_route_output_flow并进行描述。

Another data structure that appears in many of the functions in this chapter is the routing table cache entry associated with the packet, rtable. Many functions refer to it through a variable named rt. It contains information such as the outgoing device, the MTU of the outgoing device, and the next hop gateway. This structure is initialized by ip_route_output_flow and is described in Chapter 36.

ip_queue_xmit 函数

The ip_queue_xmit Function

ip_queue_xmit是TCP和SCTP当前使用的功能。它仅接收两个输入参数,处理数据包所需的所有信息都可以通过 访问(直接或间接)skb

ip_queue_xmit is the function currently used by TCP and SCTP. It receives only two input parameters, and all the information needed to process the packet is accessible (directly or indirectly) through skb.

    int ip_queue_xmit(struct sk_buff *skb, int ipfragok)
    int ip_queue_xmit(struct sk_buff *skb, int ipfragok)

以下是参数的含义:

Here is what the parameters mean:

skb
skb

要传输的数据包的缓冲区描述符。该数据结构具有填充IP标头和传输数据包(例如,下一跳网关)所需的所有参数。请记住,它ip_queue_xmit 用于处理本地生成的数据包;转发的数据包没有关联的套接字。

Buffer descriptors for the packet to transmit. This data structure has all the parameters needed to fill in the IP header and to transmit the packet (e.g., the next hop gateway). Remember that ip_queue_xmit is used to handle locally generated packets; forwarded packets do not have an associated socket.

ipfragok
ipfragok

主要由 SCTP 使用的标志,表示是否允许分段。

A flag used mainly by SCTP to say whether fragmentation is allowed.

关联的套接字skb包括一个名为 的指针opt,该指针引用我们在第 19 章“选项解析”部分中看到的结构。后一种结构包含 IP 标头中的选项,其格式使 IP 层的功能更容易访问它们。该结构保留在套接字结构中,因为它对于通过该套接字发送的每个数据包都是相同的;为每个数据包重建信息将是一种浪费。

The socket associated with skb includes a pointer named opt that refers to a structure we saw in the section "Option Parsing" in Chapter 19. The latter structure contains the options in the IP header in a format that makes them easier for functions at the IP layer to access. This structure is kept in the socket structure because it is the same for every packet sent through that socket; it would be wasteful to rebuild the information for every packet.

        struct sock *sk = skb->sk;
        结构 inet_sock *inet = inet_sk(sk);
        struct ip_options *opt = inet->opt;
        struct sock *sk = skb->sk;
        struct inet_sock *inet = inet_sk(sk);
        struct ip_options *opt = inet->opt;

该结构的字段包括opt标头中位置的偏移量,其中函数可以存储 IP 选项请求的时间戳和 IP 地址。请注意,该结构并不缓存 IP 标头本身,而是仅缓存告诉我们要写入标头的内容以及位置的数据。

Among the fields of the opt structure are offsets to the locations in the header where functions can store timestamps and IP addresses requested by IP options. Note that the structure does not cache the IP header itself, but only data that tells us what to write into the header, and where.

设置路线

Setting the route

如果缓冲区已经分配了正确的路由信息​​ ( skb->dst),则无需查阅路由表。当缓冲区由 SCTP 协议处理时,在某些情况下这是可能的:

If the buffer is already assigned the proper routing information (skb->dst), there is no need to consult the routing table. This is possible under some conditions when the buffer is handled by the SCTP protocol:

        rt = (struct rtable *) skb->dst;
        如果(rt!= NULL)
            转到数据包路由;
        rt = (struct rtable *) skb->dst;
        if (rt != NULL)
            goto packet_routed;

在其他情况下,ip_queue_xmit检查路由是否已缓存在套接字结构中,如果可用,请确保它仍然有效(这是通过 完成的_ _sk_dst_check):

In other cases, ip_queue_xmit checks whether a route is already cached in the socket structure and, if one is available, makes sure it is still valid (this is done by _ _sk_dst_check):

        rt = (struct rtable *)_ _sk_dst_check(sk, 0);
        rt = (struct rtable *)_ _sk_dst_check(sk, 0);

如果套接字还没有缓存数据包的路由,或者如果 IP 层迄今为止使用的路由在此期间已失效(例如通过路由协议的更新),则需要寻找新ip_queue_xmit路由并将ip_route_output_flow结果存储在sk数据结构中。目的地由daddr变量表示。首先,将此变量设置为数据包的最终目的地 ( inet->daddr),如果 IP 标头不包含源路由选项,则这是正确的值。但是,ip_queue_xmit然后检查源路由选项,如果存在,则将变量设置daddr为源路由中的下一跳(inet->faddr)。在严格源路由选项的情况下,找到的下一跳ip_route_output_flow必须与源路由列表中的下一跳完全匹配。

If the socket does not already have a route for the packet cached, or if the one the IP layer has been using so far has been invalidated in the meantime, such as by an update from a routing protocol, ip_queue_xmit needs to look for a new route with ip_route_output_flow and store the result in the sk data structure. The destination is represented by the daddr variable. First, this variable is set to the final destination of the packet (inet->daddr), which is the proper value if the IP header includes no Source Route option. However, ip_queue_xmit then checks for a Source Route option and, if one exists, sets the daddr variable to the next hop in the source route (inet->faddr). In case of a Strict Source Route option, the next hop found by ip_route_output_flow has to match exactly the next hop in the source route list.

        如果(rt == NULL){
            u32 爸爸;

            daddr = inet->daddr;
            if(opt && opt->srr)
                baddr = opt->faddr;

            {
                struct flowi fl = { .oif = sk->sk_bound_dev_if,
                             .nl_u = { .ip4_u =
                                       { .daddr = daddr,
                                         .saddr = inet->saddr,
                                         .tos = RT_CONN_FLAGS(sk) } },
                            .proto = sk->sk_protocol,
                            .uli_u = { .ports =
                                       { .sport = inet->sport,
                                         .dport = inet->dport } } };

                if (ip_route_output_flow(&rt, &fl, sk, 0))
                    转到无路线;
            }
            _ _sk_dst_set(sk, &rt->u.dst);
            tcp_v4_setup_caps(sk, &rt->u.dst);
        }
        if (rt == NULL) {
            u32 daddr;

            daddr = inet->daddr;
            if(opt && opt->srr)
                daddr = opt->faddr;

            {
                struct flowi fl = { .oif = sk->sk_bound_dev_if,
                             .nl_u = { .ip4_u =
                                       { .daddr = daddr,
                                         .saddr = inet->saddr,
                                         .tos = RT_CONN_FLAGS(sk) } },
                            .proto = sk->sk_protocol,
                            .uli_u = { .ports =
                                       { .sport = inet->sport,
                                         .dport = inet->dport } } };

                if (ip_route_output_flow(&rt, &fl, sk, 0))
                    goto no_route;
            }
            _ _sk_dst_set(sk, &rt->u.dst);
            tcp_v4_setup_caps(sk, &rt->u.dst);
        }

有关数据结构的详细信息,请参阅第 36 章flowi,有关例程的详细信息,请参阅第 33 章ip_route_output_flow

Refer to Chapter 36 for details on the flowi data structure, and to Chapter 33 for details on the ip_route_output_flow routine.

调用将 tcp_v4_setup_caps出口设备提供的功能保存在套接字中sk;我们可以在讨论过程中忽略这个调用。

The call to tcp_v4_setup_caps saves the features provided by the egress device in the socket sk; we can ignore this call during our discussion.

如果失败,数据包将被丢弃ip_route_output_flow。如果找到路由,则将其存储_ _sk_dst_setsk数据结构中,以便下次可以直接使用,而不必再次查阅路由表。如果由于某种原因该路由再次失效,则将来的调用将再次ip_queue_xmit使用来查找新的路由。ip_route_output_flow

The packet is dropped if ip_route_output_flow fails. If the route is found, it is stored with _ _sk_dst_set in the sk data structure so that it can be used directly next time, and the routing table does not have to be consulted again. If for some reason the route is invalidated again, a future call to ip_queue_xmit will use ip_route_output_flow once more to find a new one.

如以下代码所示,如​​果 IP 标头带有 Strict Source Routing 选项,并且该选项提供的下一跳与路由表返回的下一跳不匹配,则数据包将被丢弃:[ * ]

As the following code shows, the packet is dropped if the IP header carries the Strict Source Routing option, and the next hop provided by that option does not match the next hop returned by the routing table:[*]

        skb->dst = dst_clone(&rt->u.dst);

    数据包路由:
        if (opt && opt->is_strictroute && rt->rt_dst != rt->rt_gateway)
            转到无路线;
        skb->dst = dst_clone(&rt->u.dst);

    packet_routed:
        if (opt && opt->is_strictroute && rt->rt_dst != rt->rt_gateway)
            goto no_route;

dst_clone调用以增加分配给 的数据结构的引用计数skb->dst

dst_clone is called to increment the reference count on the data structure assigned to skb->dst.

当数据包被丢弃时,错误代码将返回到上层,并且相关的 SNMP 统计信息将被更新。请注意,在这种情况下,该函数不需要向源发送任何 ICMP(我们是源)。

When a packet is dropped, an error code is returned to the upper layer and the associated SNMP statistics are updated. Note that in this case the function does not need to send any ICMP to the source (we are the source).

相反,如果一切正常,我们就拥有了传输数据包所需的所有信息,现在就可以构建 IP 标头了。

Instead, if everything is OK, we have all the information needed to transmit the packet and it is time to build the IP header.

构建 IP 标头

Building the IP header

到目前为止,skb仅包含 IP 有效负载 - 通常是来自 L4 层的标头和有效负载,TCP 或 SCTP。这些协议总是分配缓冲区,其大小能够处理与添加较低层标头有关的最坏情况。通过这种方式,它们减少了 IP 或任何其他较低层必须进行内存复制或缓冲区重新分配以处理不适合可用空间的标头添加的机会。

So far, skb contains only the IP payload—generally the header and payload from the L4 layer, either TCP or SCTP. These protocols always allocate buffers whose size will be able to handle worst case scenarios with regards to the addition of the lower layer headers. In this way they reduce the chances that IP or any other lower layer will have to do memory copies or buffer reallocation to handle the addition of headers that do not fit the free space.

ip_queue_xmit接收时skbskb->data指向L3有效负载的开头,这是L4协议写入自己的数据的地方。L3 标头位于该指针之前。soskb_push在这里用于skb->data向后移动,使其指向 L3 或 IP 标头的开头;结果如第 19 章19-2所示。也被初始化为该位置处的指针。iph

When ip_queue_xmit receives skb, skb->data points to the beginning of the L3 payload, which is where the L4 protocol writes its own data. The L3 header lies before this pointer. So skb_push is used here to move skb->data back so that it points to the beginning of the L3 or IP header; the result is illustrated in Figure 19-2 in Chapter 19. iph is also initialized to the pointer at that location.

        iph = (struct iphdr *) skb_push(skb, sizeof(struct iphdr) +
                                        (opt ? opt->optlen : 0));
        iph = (struct iphdr *) skb_push(skb, sizeof(struct iphdr) +
                                        (opt ? opt->optlen : 0));

下一个块初始化 IP 标头中的一堆字段。第一个赋值一次设置三个字段( versionihl和)的值,因为它们共享共同的 16 位。tos因此,该语句将标头中的版本设置为 4,标头长度设置为 5,TOS 设置为inet->tos

The next block initializes a bunch of fields in the IP header. The first assignment sets the value of three fields (version, ihl and tos) in one shot, because they share a common 16 bits. Thus, the statement sets the Version in the header to 4, the Header Length to 5, and the TOS to inet->tos.

用于初始化 IP 标头的一些值取自sk,其他一些取自 ,这两个值均在前面的“本地流量的相关套接字数据结构rt”部分中进行了描述。

Some of the values used to initialize the IP header are taken from sk and some others from rt, both of which were described earlier in the section "Relevant Socket Data Structures for Local Traffic."

        *((_ _u16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
        iph->tot_len = htons(skb->len);
        if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok)
            iph->frag_off = htons(IP_DF);
        别的
            iph->frag_off = 0;
        iph->ttl = ip_select_ttl(inet, &rt->u.dst);
        iph->协议 = sk->sk_protocol;
        iph->saddr = rt->rt_src;
        iph->daddr = rt->rt_dst;
        skb->nh.iph = iph;
        *((_ _u16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
        iph->tot_len = htons(skb->len);
        if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok)
            iph->frag_off = htons(IP_DF);
        else
            iph->frag_off = 0;
        iph->ttl      = ip_select_ttl(inet, &rt->u.dst);
        iph->protocol = sk->sk_protocol;
        iph->saddr    = rt->rt_src;
        iph->daddr    = rt->rt_dst;
        skb->nh.iph   = iph;

如果IP标头包含选项,则该函数需要更新 iph->length之前初始化为其默认值的标头长度字段,然后调用ip_options_build以处理选项。ip_options_build使用opt 先前初始化为 的变量inet->opt将所需的选项字段(例如时间戳)添加到 IP 标头。请注意,最后一个参数 toip_options_build设置为零,以指定标头不属于片段(请参阅第 19 章中的“ IP 选项”部分)。

If the IP header contains options, the function needs to update the Header Length field iph->length, which was previously initialized to its default value, and then call ip_options_build to take care of the options. ip_options_build uses the opt variable, previously initialized to inet->opt, to add the required option fields (such as timestamps) to the IP header. Note that the last parameter to ip_options_build is set to zero, to specify that the header does not belong to a fragment (see the section "IP Options" in Chapter 19).

        if(opt && opt->optlen) {
            iph->ihl += opt->optlen >> 2;
            ip_options_build(skb, opt, inet->daddr, rt, 0);
        }

        mtu = dst_pmtu(&rt->u.dst);
        if(opt && opt->optlen) {
            iph->ihl += opt->optlen >> 2;
            ip_options_build(skb, opt, inet->daddr, rt, 0);
        }

        mtu = dst_pmtu(&rt->u.dst);

然后 ip_select_ident_more根据数据包是否可能分片,在报头中设置IP ID(参见第23章“选择IP报头的ID字段” 一节),并计算IP报头的校验和。ip_send_check

Then ip_select_ident_more sets the IP ID in the header based on whether the packet is likely to be fragmented (see the section "Selecting the IP Header's ID Field" in Chapter 23), and ip_send_check computes the checksum on the IP header.

skb->priority流量控制使用它来决定将数据包放入哪一个传出队列中;这反过来又有助于确定传输的速度。该函数中的值取自该结构sock,而 in ip_forward(管理非本地流量,因此没有本地套接字)其值是从基于 IP TOS 值的转换表中派生出来的(请参阅中的“ ip_forward 函数”部分)第 20 章)。

skb->priority is used by Traffic Control to decide which one of the outgoing queues to enqueue the packet in; this in turn helps determine how soon it will be transmitted. The value in this function is taken from the sock structure, whereas in ip_forward (which manages nonlocal traffic and therefore does not have a local socket) its value is derived from a conversion table based on the IP TOS value (see the section "ip_forward Function" in Chapter 20).

        ip_select_ident_more(iph, &rt->u.dst, sk, skb_shinfo(skb)->tso_segs);
        ip_send_check(iph);
        skb->优先级 = sk->sk_priority;
        ip_select_ident_more(iph, &rt->u.dst, sk, skb_shinfo(skb)->tso_segs);
        ip_send_check(iph);
        skb->priority = sk->sk_priority;

最后调用Netfilter判断数据包是否有权跳转到下面的步骤(dst_output)继续传输:

Finally, Netfilter is called to see whether the packet has the right to jump to the following step (dst_output) and continue transmission:

        返回 NF_HOOK(PF_INET, NF_IP_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
                   dst_输出);
        return NF_HOOK(PF_INET, NF_IP_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
                   dst_output);

ip_append_data 函数

The ip_append_data Function

这是那些想要缓冲传输数据的 L4 协议使用的函数。正如本章前面所述,该函数不传输数据,而是将其放置在大小合适的缓冲区中,以便后续函数形成片段(如果需要)并传输。因此,它不会创建或操作任何 IP 标头。为了刷新和传输由 缓冲的数据ip_append_data,L4 层必须显式调用ip_push_pending_frames,它还负责处理 IP 标头。

This is the function used by those L4 protocols that want to buffer data for transmission. As stated earlier in this chapter, this function does not transmit data, but places it in conveniently sized buffers for later functions to form into fragments (if necessary) and transmit. Thus, it does not create or manipulate any IP header. To flush and transmit the data buffered by ip_append_data, the L4 layer has to explicitly call ip_push_pending_frames, which also takes care of the IP header.

如果 L4 层想要快速响应时间,它可能会ip_push_pending_frames在每次调用 后调用ip_append_data。但提供这两个功能是为了让L4层可以缓冲尽可能多的数据(最多达到PMTU的大小),然后立即发送,以提高效率。

If the L4 layer wants fast response time, it might call ip_push_pending_frames after each call to ip_append_data. But the two functions are provided so that the L4 layer can buffer as much data as possible (up to the size of the PMTU) and then send it at once to be efficient.

作为其在准备数据包中的作用的结果之一,ip_append_data缓冲数据仅达到 IP 数据包的最大大小。正如第 18 章“数据包碎片/碎片整理”部分所述,这是 64 KB。

As one consequence of its role in preparing packets, ip_append_data buffers data only up to the maximum size of an IP packet. As explained in the section "Packet Fragmentation/Defragmentation" in Chapter 18, this is 64 KB.

主要任务ip_append_data是:

The main tasks of ip_append_data are:

  • 将来自 L4 层的输入数据组织到缓冲区中,缓冲区的大小将使在需要时更容易处理 IP 碎片。这包括将这些数据片段放入缓冲区,以便 L3 和 L2 层稍后可以轻松添加较低层标头。

  • Organize the input data from the L4 layer into buffers whose size will make it easier to handle IP fragmentation if needed. This includes placing those data fragments into buffers in such a way that the L3 and L2 layers will find it easy to add the lower-layer headers later.

  • 考虑来自上层的信息和出口设备的功能,优化内存分配。尤其:

    • 如果上层发出信号表明不久将有更多传输请求(通过标志MSG_MORE),则分配更大的缓冲区可能是有意义的。

    • 如果出口设备支持分散/聚集 I/O ( NETIF_F_SG),则可以将片段排列在内存页中以优化内存处理。

  • Optimize memory allocation, taking into account information from upper layers and the capabilities of the egress device. In particular:

    • If upper layers signal that more transmission requests will follow shortly (through the MSG_MORE flag), it could make sense to allocate a bigger buffer.

    • If the egress device supports Scatter/Gather I/O (NETIF_F_SG), fragments can be arranged in memory pages to optimize memory handling.

  • 注意 L4 校验和。我们在第 19 章的“ net_device 结构”部分看到了如何根据出口设备功能和其他因素进行初始化。skb->ip_summed

  • Take care of the L4 checksum. We saw in the section "net_device structure" in Chapter 19 how skb->ip_summed is initialized based on the egress device capabilities and other factors.

ip_append_data与 相比,考虑到的工作更为复杂ip_queue_xmit,其更复杂的原型应该不足为奇:

Given the more complex job of ip_append_data, compared to ip_queue_xmit, its more complex prototype should not come as a surprise:

    int ip_append_data(struct sock *sk,
                    int getfrag(void *from, char *to, int offset, int len,
                                int 奇数,结构 sk_buff *skb),
                    void *from,int length,int transhdrlen,
                    结构 ipcm_cookie *ipc,结构 rtable *rt,
                    无符号整型标志)
    int ip_append_data(struct sock *sk,
                    int getfrag(void *from, char *to, int offset, int len,
                                int odd, struct sk_buff *skb),
                    void *from, int length, int transhdrlen,
                    struct ipcm_cookie *ipc, struct rtable *rt,
                    unsigned int flags)

输入参数的含义如下:

Here is the meaning of the input parameters:

sk
sk

此数据包传输背后的套接字。该数据结构包含一些参数(例如 IP 选项),稍后将需要这些参数(由函数)填充 IP 标头ip_push_pending_frames

Socket behind this packet's transmission. This data structure contains some of the parameters (such as the IP options) that will be needed later to fill in the IP header (by the ip_push_pending_frames function).

from
from

指向 L4 层尝试传输的数据(有效负载)的指针。这可以是内核或用户空间指针,getfrag正确处理它是函数的工作(如下所述)。

Pointer to the data (payload) the L4 layer is trying to transmit. This can be either a kernel or a user-space pointer, and it's the getfrag function's job (described next) to handle it correctly.

getfrag
getfrag

用于将从 L4 层接收到的有效负载复制到将创建的数据片段中的函数。更多详细信息可以在“将数据复制到片段中:getfrag ”部分中找到。

Function used to copy the payload received from the L4 layer into the data fragments that will be created. More details can be found in the section "Copying data into the fragments: getfrag."

length
length

要传输的数据量(包括 L4 标头和 L4 负载)。

Amount of data to transmit (including both the L4 header and the L4 payload).

transhdrlen
transhdrlen

传输 (L4) 标头的大小。

Size of the transport (L4) header.

ipc
ipc

正确转发数据包所需的信息。请参见第 23 章中的“ ipcm_cookie 结构”部分。

Information needed to forward the packet correctly. See the section "ipcm_cookie Structure" in Chapter 23.

rt
rt

与数据包关联的路由表缓存条目(第 36 章中介绍)。虽然ip_queue_xmit自己检索此信息,但 ip_append_data依赖调用者通过ip_route_output_flow.

Routing table cache entry associated with the packet (described in Chapter 36). While ip_queue_xmit retrieves this information itself, ip_append_data relies on the caller to collect that information by means of ip_route_output_flow.

flags
flags

该变量可以包含include/linux/socket.hMSG_ XXX中定义的任何标志。此函数使用其中三个:

MSG_MORE

应用程序使用此标志告诉 L4 层很快就会有更多传输。正如我们在这里看到的,该标志被传播到 L3 层。稍后我们将看到这些信息在分配缓冲区时如何有用。

MSG_DONTWAIT

设置此标志后,调用ip_append_data不得阻塞。ip_append_data可能需要sock_alloc_send_skb为套接字分配一个缓冲区(with)sk。当后者已经耗尽其预算时,它可以阻塞(使用计时器)以希望在计时器到期之前有一些空间可用,或者失败。该标志可用于在前面的两个选项之间进行选择。

MSG_PROBE

当这个标志被设置时,用户实际上并不想传输任何东西;他只是在探索道路。例如,该标志可用于测试通往给定 IP 地址的路径上的 PMTU。如果设置了此标志, [ * ] ip_append_data会立即返回成功的返回代码。

This variable can contain any of the MSG_ XXX flags defined in include/linux/socket.h. Three of them are used by this function:

MSG_MORE

This flag is used by applications to tell the L4 layer that there will be more transmissions shortly. As we see here, this flag is propagated to the L3 layer. Later we will see how this information can be useful when allocating buffers.

MSG_DONTWAIT

When this flag is set, the call to ip_append_data must not block. ip_append_data may need to allocate a buffer (with sock_alloc_send_skb) for the socket sk. When the latter has already exhausted its budget, it can either block (with a timer) in the hope that some space will be made available before the timer expires, or fail. This flag can be used to choose between the two previous options.

MSG_PROBE

When this flag is set, the user does not really want to transmit anything; he is only probing the path. The flag can be used, for instance, to test a PMTU on a path toward a given IP address.[*] ip_append_data simply returns immediately with a successful return code if this flag is set.

ip_append_data是一个又长又复杂的函数。存在大量已定义的局部变量(通常具有相似的名称),使其难以理解。因此,我们将其分解为主要步骤。鉴于可能的输出有许多不同的组合,根据本节开头附近列出的考虑因素,我们将重点关注一些常见情况。到最后,你应该能够自己推导出其他案例。

ip_append_data is a long and complex function. The presence of numerous local variables defined, often with similar names, make it hard to follow. We will therefore break it down into the main steps. Given that there are many different combinations of possible outputs, based on the considerations listed near the beginning of this section, we will focus on a few common cases. By the end, you should be able to derive the other cases by yourself.

接下来的几节描述了结果ip_append_data应该是什么。之后是描述该函数的初始任务的几个部分,最后是对其主循环的描述。

The next few sections describe what the outcome of ip_append_data should be. After that come several sections describing the initial tasks of the function, finishing with a description of its main loop.

图 21-221-7中的标签hh_lenexthdrlenfraghdrlentrailer_lencopy和 是 的输入参数 或使用的局部变量(特别是,图中所示的值是传递给 的值)。它们全部以字节表示。标签 X、Y、S、S1 和 S2 表示以字节表示的数据块的大小。lengthip_append_dataip_append_datacopygetfrag

The labels hh_len, exthdrlen, fraghdrlen, trailer_len, copy, and length in Figures 21-2 through 21-7 are either input parameters to ip_append_data or local variables used by ip_append_data (in particular, the value of copy shown in the figures is the one passed to getfrag). All of them are expressed in bytes. The labels X, Y, S, S1, and S2 represent the size of a data block expressed in bytes.

ip_append_data 的基本内存分配和缓冲区组织

Basic memory allocation and buffer organization for ip_append_data

了解如何在内存中组织输出(ip_append_data将转换为 IP 数据包的片段)非常重要。本节和接下来的两节介绍了组织输出数据的数据结构以及它们的使用方式。同样的解释适用于由 L4 层格式化并传递给 的数据ip_queue_xmit:例如,这是通过 TCP 而不是使用 来完成的ip_append_data。在每种情况下,缓冲区最终都会被传递给,它出现在第 18 章图 18-1dst_output的中心附近。让我们看几个例子。

It is important to understand how the output from ip_append_data—the fragments to be turned into IP packets—is organized in memory. This section and the following two sections cover the data structures that organize the output data and how they are used. The same explanation applies to data formatted by the L4 layer and passed to ip_queue_xmit: this is done, for instance, by TCP instead of using ip_append_data. In every case, the buffers are eventually handed to dst_output, which appears near the center of Figure 18-1 in Chapter 18. Let's see a few examples.

ip_append_data可以创建一个或多个sk_buff实例,每个实例代表一个不同的 IP 数据包(或 IP 片段)。无论数据如何存储sk_buff(即,无论是否分片),都是如此。

ip_append_data can create one or more sk_buff instances, each representing a distinct IP packet (or IP fragment). This is true regardless of how the data is stored in sk_buff (i.e., regardless of whether it is fragmented).

假设我们要传输位于 PMTU 内的大量数据(即不需要分段)。我们还假设,由于主机的配置,我们需要应用至少一种 IPsec 套件协议。最后,为了简单起见,我们假设我们没有尝试通过分配缓冲区的方式来实现内存优化。本例的结果(如图21-2ip_append_data所示)如下:

Suppose we want to transmit an amount of data that lies within the PMTU (that is, it does not need to be fragmented). Let's also assume that because of the configuration of our host, we need to apply at least one of the protocols of the IPsec suite. Finally, let's suppose for the sake of simplicity that we are not trying to achieve memory optimizations in the way we allocate buffers. The results of ip_append_data (shown in Figure 21-2) in this case are as follows:

  • 因为不需要碎片,所以我们只分配一个缓冲区。

  • Because no fragmentation is needed, we allocate just one buffer.

  • IPsec 套件的协议可能需要标头和尾部,它们环绕传统缓冲区(包括其传统 IP 标头)。在分配缓冲区以及将数据从 L4 层复制到缓冲区时,我们都需要考虑到这一点。

  • The protocols of the IPsec suite may require both a header and a trailer, which wrap around the traditional buffer (including its traditional IP header). We need to take that into account both when allocating the buffer and when copying the data from the L4 layer into the buffer.

  • 我们还在 L2 层上为标头预先分配空间。[ * ]

  • We also preallocate space for the header on the L2 layer. [*]

通过为 L4 层之后的所有协议和层保留所需的空间,我们消除了以后耗时的内存操作的需要。另请注意,指向某些标头(例如 h.rawnh.raw)的指针已初始化;稍后相关协议可以填补它们的部分。数据包中唯一填充的部分ip_append_data是 L4 有效负载。其他部分填写如下:

By reserving the space needed for all the protocols and layers that will come after the L4 layer, we eliminate the need for time-consuming memory manipulation later. Note also that the pointers to some of the headers (such as h.raw and nh.raw) are initialized; later the associated protocols can fill in their part. The only portion of the packet that is filled in by ip_append_data is the L4 payload. Other parts will be filled in as follows:

  • L4 标头将由调用的函数填充ip_push_pending_frames。该函数可以直接调用,也可以通过包装器调用(例如,UDP 使用udp_push_pending_frames)。

  • The L4 header will be filled in by the function that calls ip_push_pending_frames. That function can be invoked directly or via a wrapper (for example, UDP uses udp_push_pending_frames).

  • L3 标头(包括 IP 选项)将由 填充ip_push_pending_frames

  • The L3 header (including the IP options) will be filled in by ip_push_pending_frames.

不需要分片的 IP 数据包,具有 IPsec

图 21-2。不需要分片的 IP 数据包,具有 IPsec

Figure 21-2. IP packet that does not need fragmentation, with IPsec

第 VI 部分涵盖了报头的 L2 部分。

Part VI covers the L2 part of the header.

现在让我们举一个稍微复杂一点、需要碎片的例子。在前面的示例中,我们删除 IPsec 并增加有效负载大小,使其超过 PMTU。图 21-3显示了输出。[ * ]

Now let's take a slightly more complex example that requires fragmentation. From the previous example, let's remove IPsec and increase the payload size so that it exceeds the PMTU. Figure 21-3 shows the output.[*]

无分散/聚集 I/O 的碎片,无 MSG_MORE

图 21-3。无分散/聚集 I/O 的碎片,无 MSG_MORE

Figure 21-3. Fragmentation without Scatter/Gather I/O, no MSG_MORE

左下角的对象是ip_append_data接收输入的缓冲区,length也是 的另一个ip_append_data输入参数。该函数创建的两个缓冲区位于右侧。请注意,第一个包含具有最大大小 (PMTU) 的片段,第二个包含剩余数据。ip_append_data根据 PMTU 创建所需数量的缓冲区;这里发生的情况是,第二个负载容纳了所有剩余的有效负载,并且它比 PMTU 小。

The object on the bottom left is the buffer that ip_append_data receives in input, and length is another of ip_append_data's input parameters. Two buffers created by the function lie to the right. Note that the first contains a fragment that has the maximum size (PMTU), and the second contains leftover data. ip_append_data creates as many buffers as necessary based on the PMTU; it happens here that a second one holds all the remaining payload, and that it is smaller than the PMTU.

我们之前说过ip_append_data不会传输任何内容;它只是创建稍后用于数据包片段的缓冲区。这意味着 L4 层可能会ip_append_data再次调用前面的示例并添加更多数据。

We said previously that ip_append_data will not transmit anything; it just creates buffers to be used later for packet fragments. This means that the L4 layer can potentially invoke ip_append_data again for either of the previous examples and add more data.

让我们看第二个例子并展示会发生什么。由于第二个缓冲区已满,我们被迫分配一个新缓冲区。这可能会导致碎片不理想;最好让除最后一个片段之外的每个片段都填充到 PMTU 的大小。

Let's take the second example and show what happens. Since the second buffer is full, we are forced to allocate a new buffer. This might end up with suboptimal fragmentation; it would be better to have every fragment except the last one fill up to the size of the PMTU.

此时,实现最佳碎片的一个简单解决方案是分配另一个最大大小的缓冲区,从第二个缓冲区复制数据,删除第二个缓冲区,然后将新数据合并到新缓冲区中。如果没有足够的空间,我们可以分配第三个缓冲区。但这种方法并不能提供良好的性能。它破坏了在调用之前进行数据分片的本质原因(如第18章18-1ip_fragment所示),即避免额外的内存复制。

One simple solution to achieve optimal fragmentation, at this point, is to allocate another buffer of maximum size, copy the data there from the second buffer, delete the second buffer, and merge the new data into the new buffer. If there is not enough space, we can allocate a third buffer. But this approach does not offer good performance. It vitiates the essential reason for doing data fragmentation before calling ip_fragment (shown in Figure 18-1 in Chapter 18), which is to avoid extra memory copies.

现在应该清楚为什么“ ip_append_data 函数MSG_MORE”部分中介绍的标志很有用了。例如,如果在第二个示例中,我们知道将会有第二个调用,我们将直接分配具有最大大小的第二个缓冲区,产生图 21-4中的输出(请注意,L2 标头的大小不是包含在 PMTU 中)。hh_len

Now it should be clear why the MSG_MORE flag introduced in the section "The ip_append_data Function" can be useful. For example, if in the second example, we knew a second call would be coming, we would have allocated the second buffer with the maximum size directly, producing the output in Figure 21-4 (note that the size of the L2 header hh_len is not included in the PMTU).

如果ip_append_data再次调用 before ip_push_pending_frames,它将首先尝试填充图 21-4中第二个缓冲区中的空白空间,然后再分配第三个缓冲区。

If ip_append_data is called again before ip_push_pending_frames, it will first try to fill in the empty space in the second buffer in Figure 21-4 before allocating a third.

无分散/聚集 I/O 的碎片,MSG_MORE

图 21-4。无分散/聚集 I/O 的碎片,MSG_MORE

Figure 21-4. Fragmentation without Scatter/Gather I/O, MSG_MORE

使用 Scatter Gather I/O 进行 ip_append_data 的内存分配和缓冲区组织

Memory allocation and buffer organization for ip_append_data with Scatter Gather I/O

有时,即使尚未分配最大大小,实际上也可以将数据添加到片段中。当设备支持分散/聚集 I/O 时这是可能的。这仅仅意味着 L3 层将数据保留在 L4 层放置数据的缓冲区中,并让设备组合这些缓冲区来进行传输。Scatter/Gather I/O 的优点是减少了分配内存和复制数据的开销。

Sometimes it is actually possible to add data to a fragment even if it has not been allocated with the maximum size. That is possible when the device supports Scatter/Gather I/O . This simply means that the L3 layer leaves data in the buffers where the L4 layer placed it, and lets the device combine those buffers to do the transmission. The advantage of Scatter/Gather I/O is that it reduces the overhead of allocating memory and copying data.

考虑一下:上层可能会在连续的操作中生成许多小数据项,而 L4 层可能会将它们存储在内核内存的不同缓冲区中。然后,L3 层被要求在一个 IP 数据包中传输所有这些项目。如果没有 Scatter/Gather I/O,L3 层必须将数据复制到新的缓冲区中以形成统一的数据包。如果设备支持分散/聚集 I/O,则数据可以保留在原来的位置,直到离开主机。

Consider this: an upper layer may generate many small items of data in successive operations and the L4 layer may store them in different buffers of kernel memory. The L3 layer is then asked to transmit all of these items in one IP packet. Without Scatter/Gather I/O, the L3 layer has to copy the data into new buffers to make a unified packet. If the device supports Scatter/Gather I/O, the data can stay right where it is until it leaves the host.

当使用 Scatter/Gather I/O 时,skb->data仅第一次使用指向的内存区域。以下数据块被复制到专门为此目的分配的内存页中。图 21-521-6ip_append_data比较了启用 Scatter/Gather I/O 时与禁用 Scatter/Gather I/O 时如何保存在第二次调用中接收到的数据 :

When Scatter/Gather I/O is in use, the memory area to which skb->data points is used only the first time. The following chunks of data are copied into pages of memory allocated specifically for this purpose. Figures 21-5 and 21-6 compare how the data received by ip_append_data in its second invocation is saved when Scatter/Gather I/O is enabled, versus when it is disabled:

  • 图 21-5(a)显示了第一次调用后的内存使用情况,图 21-5(b)显示了第二次调用后(启用 Scatter/Gather I/O 时)的 内存使用情况。使用的缓冲区frags称为分页缓冲区。请注意,图 21-5(b)中的数据片段不需要任何标头:请记住,一个sk_buff实例的所有数据片段都与同一个 IP 数据包关联。这也意味着 X+S1 仍然小于 PMTU。

  • Figure 21-5(a) shows memory use after the first call and Figure 21-5(b) shows it after the second call, when Scatter/Gather I/O is enabled. A buffer that uses frags is called a paged buffer. Note that the data fragment in Figure 21-5(b) does not need any header: remember that all data fragments of one sk_buff instance are associated with the same IP packet. This also implies that X+S1 is still smaller than the PMTU.

  • 图 21-6(a)显示了第一次调用后的内存使用情况,图 21-6(b)显示了第二次调用后的内存使用情况( 禁用Scatter/Gather I/O 时)。

  • Figure 21-6(a) shows memory use after the first call and Figure 21-6(b) shows it after the second call, when Scatter/Gather I/O is disabled.

一些辅助数据结构支持分散/聚集 I/O。除第一个缓冲区(其分配方式与不支持分散/聚集 I/O 时的分配方式相同)之外的每个缓冲区都存储在skb_shinfo(skb)->frags. 这些可以通过熟悉结构中的指针找到sk_buff。正如我们在第 2 章中看到的,每个sk_buff结构都包含一个类型字段skb_shared_info,可以使用宏来访问该字段skb_shinfo。该结构可用于通过添加可以位于任何位置(不一定彼此相邻)的内存区域来增加缓冲区的大小。这nr_frags 该字段帮助 IP 层记住有多少个 Scatter/Gather I/O 缓冲区挂在该数据包上。请注意,该字段计算分散/收集 I/O 缓冲区的数量,而不是 IP 片段,正如其名称所暗示的那样。

Some ancillary data structures support Scatter/Gather I/O. Each buffer except the first (which is allocated in the same way as when there is no support for Scatter/Gather I/O) is stored in skb_shinfo(skb)->frags. These can be found through pointers in the familiar sk_buff structure. As we saw in Chapter 2, each sk_buff structure includes a field of type skb_shared_info, which can be accessed with the macro skb_shinfo. This structure can be used to increase the size of the buffer by adding memory areas that can be located anywhere, not necessarily adjacent to one other. The nr_frags field helps the IP layer remember how many Scatter/Gather I/O buffers hang off of this packet. Note that this field counts Scatter/Gather I/O buffers—not IP fragments, as the name might suggest.

现在我们可以看看为什么内核需要设备端的特殊支持来使用这种缓冲区表示:为了能够引用不连续但其内容应该表示连续数据片段的内存区域,设备必须能够处理这种缓冲区表示。请注意,图 21-7显示了一个简单示例,其中一页包含两个相邻的内存区域。但片段很容易不相邻,无论是在单个页面内还是在不同页面上。

Now we can look at why the kernel needs special support on the device side to use this kind of buffer representation: to be able to refer to memory areas that are not contiguous but whose content is supposed to represent a contiguous data fragment, the device must be able to handle that kind of buffer representation. Note that Figure 21-7 shows the simple example where there is one page that contains two adjacent memory areas. But the fragments could easily be nonadjacent, either within a single page or on different pages.

带有分散/聚集 I/O 的 ip_append_data

图 21-5。带有分散/聚集 I/O 的 ip_append_data

Figure 21-5. ip_append_data with Scatter/Gather I/O

数组的每个元素frags都由一个结构体表示skb_frag_t,该结构体包括指向内存页的指针、相对于页开头的偏移量以及片段的大小。请注意,由于图 21-7中的两个片段位于同一内存页内,因此它们的page指针指向同一内存页。最大片段数为MAX_SKB_FRAGS,它是根据 IP 数据包的最大大小(64 KB)和内存页的大小(根据每个体系结构定义的,在 i386 上的默认值为 4 KB)来定义的)。

Each element of the frags array is represented by an skb_frag_t structure, which includes a pointer to a memory page, an offset relative to the beginning of the page, and the size of the fragment. Note that since the two fragments in Figure 21-7 are located within the same memory page, their page pointer points to the same memory page. The maximum number of fragments is MAX_SKB_FRAGS, which is defined based on the maximum size of an IP packet (64 KB) and the size of a memory page (which is defined on a per-architecture basis and whose default value on an i386 is 4 KB).

没有分散/聚集 I/O 的 ip_append_data

图 21-6。没有分散/聚集 I/O 的 ip_append_data

Figure 21-6. ip_append_data without scatter/gather I/O

您可以在include/linux/sk_buff.h中找到前面提到的所有结构的定义。

You can find the definitions of all the previously mentioned structures in include/linux/sk_buff.h.

图 21-7显示了只有一页的情况,但由于可能有多个页,因此数组的元素frags包括page指向正确页的指针。一个片段不能跨越两页。当新片段的大小大于当前页面中的可用空间量时,该片段将分为两部分:一部分进入已存在的页面并填充它,第二部分进入新页面。

Figure 21-7 shows the case where there is only one page, but since there could be several pages, the elements of the frags array include the page pointer to the proper page. A fragment cannot span two pages. When the size of a new fragment is bigger than the amount of free space in the current page, the fragment is split into two parts: one goes to the already existent page and fills it, and the second part goes into a new page.

具有分散/聚集 I/O 的多个片段

图 21-7。具有分散/聚集 I/O 的多个片段

Figure 21-7. Multiple fragments with Scatter/Gather I/O

需要记住的一个重要细节是分散/聚集 I/O 独立于 IP 数据碎片。分散/聚集 I/O 只是允许代码和硬件在不相邻的内存区域上工作,就好像它们是相邻的一样。尽管如此,每个片段仍然必须遵守其最大大小(PMTU)的限制。这意味着即使PAGE_SIZE大于 PMTU,当( 指向的)sk_buff中的数据 加上 引用的数据达到 PMTU 时,也会创建一个新的。sk_buffskb->datafrags

One important detail to keep in mind is that Scatter/Gather I/O is independent from IP data fragmentation. Scatter/Gather I/O simply allows the code and hardware to work on nonadjacent memory areas as if they were adjacent. Nevertheless, each fragment must still respect the limit on its maximum size (the PMTU). This means that even if PAGE_SIZE is bigger than the PMTU, a new sk_buff will be created when the data in sk_buff (pointed to by skb->data) plus the ones referenced with frags reaches the PMTU.

还要注意的是,同一页可以保存不同IP片段的数据片段,如图21-8所示。添加到内存页的每个数据片段都会增加该页的引用计数。当IP分片最终发送出去,页中的数据分片被释放时,引用计数相应减少,内存页被释放(参见 ,skb_release_data由 间接调用kfree_skb)。

Note also that the same page can hold fragments of data for different IP fragments, as shown in Figure 21-8. Each fragment of data added to the memory page increments the page's reference count. When the IP fragments are finally sent out and the data fragments in the page are released, the reference count is decreased accordingly and the memory page is released (see skb_release_data, which is called indirectly by kfree_skb).

图 21-8sock 左上角的结构包括指向最后一页的指针 ( ) 和该页内应放置下一个数据片段的偏移量 ( )。sk_sndmsg_pagesk_sndmsg_off

The sock structure on the top left of Figure 21-8 includes both a pointer to the last page (sk_sndmsg_page) and an offset (sk_sndmsg_off) inside that page where the next data fragment should be placed.

IP片段之间共享的内存页

图 21-8。IP片段之间共享的内存页

Figure 21-8. Memory page shared between IP fragments

处理碎片缓冲区的关键例程

Key routines for handling fragmented buffers

要理解本章和第 22 章中描述的函数,您需要熟悉第 2 章中介绍的关键缓冲区操作例程以及以下例程:

To understand the functions described in this chapter and the ones in Chapter 22, you need to be familiar with the key buffer manipulation routines introduced in Chapter 2, and the following ones:

skb_is_nonlinear
skb_is_nonlinear

当缓冲区已分段(即skb->data_len非空)时返回 true。

Returns true when the buffer is fragmented (i.e., skb->data_len is non-null).

skb_headlen
skb_headlen

给定一个分段缓冲区,返回主缓冲区中的数据量(即,它不考虑片段,frags 也不考虑列表frag_list)。不要误skb_headlen认为 skb_headroom:后者返回skb->head和之间的可用空间skb->data

Given a fragmented buffer, returns the amount of data in the main buffer (i.e., it does not account for the frags fragments nor does it take the frag_list list into account). Do not mistake skb_headlen for skb_headroom: the latter returns the free space between skb->head and skb->data.

skb_pagelen
skb_pagelen

分段缓冲区的大小;它考虑了主缓冲区 ( skb_headlen) 中的数据和片段中的数据frags,但它不考虑链接到列表的任何缓冲区frag_list

Size of a fragmented buffer; it accounts for the data in the main buffer (skb_headlen) and the data in the frags fragments, but it does not consider any buffer linked to the frag_list list.

图 21-9显示了几个示例。请注意,其中包括(在 中更新 ) 和(在 中更新)中skb->len 的数据片段。我省略了有关协议头的详细信息,因为它们对于我们的讨论来说不是必需的。fragsip_append_datafrag_listip_push_pending_frames

Figure 21-9 shows a couple of examples. Note that skb->len includes the data fragments in frags (updated in ip_append_data) and in frag_list (updated in ip_push_pending_frames). I have omitted the details about the protocol headers because they are not necessary for our discussion.

分段缓冲区的关键功能: (a) 分散/聚集; (b) 无分散/聚集

图 21-9。分段缓冲区的关键功能: (a) 分散/聚集;(b) 无分散/聚集

Figure 21-9. Key functions for fragmented buffers: (a) Scatter/Gather; (b) no Scatter/Gather

我还想再次强调这一点:向量中的数据 frags是主缓冲区中数据的扩展,并且中的数据frags_list代表独立的缓冲区(即每个缓冲区将作为单独的IP片段独立传输)。

I also would like to stress this point once more: the data in the frags vector is an extension to the data in the main buffer, and the data in frags_list represents independent buffers (i.e., each one will be transmitted independently as a separate IP fragment).

缓冲区的进一步处理

Further handling of the buffers

ip_append_data每当分配一个新的结构来处理一个新的数据片段(它将成为一个新的 IP 片段)时sk_buff,它会将片段放入一个名为 的队列中,该队列与的输入套接字sw_write_queue相关联。该队列是该函数的输出。后面的函数只需将 IP 标头添加到数据片段中,并将它们下推到 L2 层(确切地说,到例程)。ip_append_dataskdst_output

Whenever ip_append_data allocates a new sk_buff structure to handle a new data fragment (which will become a new IP fragment), it queues the fragment onto a queue called sw_write_queue that is associated with ip_append_data's input socket sk. This queue is the output of the function. Later functions need only add the IP headers to the data fragments and push them down to the L2 layer (to the dst_output routine, to be exact).

sk_write_queue列表作为先进先出 (FIFO) 队列进行管理,如下所示:

The sk_write_queue list is managed as a First In, First Out (FIFO) queue, as follows:

  • 新元素(片段)添加在尾部。由此可见,第一个元素是包含外部标头的元素,例如 IPsec(如果有)和 L4 标头(或其一部分,如果 PMTU 相对较小)。

  • New elements (fragments) are added at the tail. It follows that the first element is the one that includes external headers such as IPsec (if any) and the L4 header (or part of it, if the PMTU is relatively small).

  • sk_write_queue仅当最后一个片段的大小达到最大大小 ( maxfraglen)时,才会创建新元素并将其添加到列表中。(这里的“大小”是指作为该数据包的一部分传输的数据,即图21-5中的灰色部分。它不是缓冲区的大小,缓冲区的大小可能已分配为大于可用数据以容纳以后的数据。)这是因为ip_append_data永远不会创建大于与路由关联的 PMTU 的片段。当使用 Scatter/Gather I/O 时,新的数据块将存储在内存页中,而不是 指向的区域中skb->data

  • A new element is created and added to the list only when the size of the last fragment in sk_write_queue has reached the maximum size (maxfraglen). (The "size" here refers to the data being transmitted as part of that packet, which is the gray portions of Figure 21-5. It is not the size of the buffer, which might have been allocated to be larger than the available data to accommodate later data.) This is because ip_append_data never creates a fragment bigger than the PMTU associated with the route. When Scatter/Gather I/O is used, new chunks of data are stored in memory pages instead of the area pointed to by skb->data.

现在我们知道ip_append_data会产生什么样的输出,我们可以查看代码。再次记住,L4 层ip_append_data在使用 刷新缓冲区之前可以调用多次ip_push_pending_frames

Now that we know what kind of output ip_append_data produces, we can look at the code. Once again, keep in mind that the L4 layer can call ip_append_data several times before flushing the buffers with ip_push_pending_frames.

假设 UDP 发出了三个ip_append_data具有以下有效负载大小的调用:300、250 和 200 字节。我们还假设 PMTU 是 500 字节。应该清楚的是,如果 UDP 发送了 750 字节的单个有效负载,IP 层将创建第一个 500 字节的片段和第二个 250 字节的片段。[ * ]但是,使用该 UDP 套接字的应用程序实际上可能想要发送大小分别为 300、250 和 200 字节的三个不同的 IP 数据包。ip_append_data可以被告知要采取哪种行为方式。如果 UDP 套接字后面的应用程序希望获得更高的吞吐量,它会使用该MSG_MORE标志来告诉ip_append_data创建最大大小片段(500 字节),结果将是第一个片段为 500 字节,第二个片段为 250 字节。如果它没有表示优先选择此类缓冲,则 UDP 会单独传输每个有效负载(请参阅“组合传输功能”部分)。

Let's suppose that UDP issued three calls to ip_append_data with the following payload sizes: 300, 250, and 200 bytes. Let's also assume the PMTU is 500 bytes. It should be clear that if UDP had sent a single payload of 750 bytes, the IP layer would have created a first fragment of 500 bytes and a second one of 250 bytes.[*] However, the application using that UDP socket might actually want to send three distinct IP packets of sizes 300, 250, and 200 bytes. ip_append_data can be told which way to behave. If the application behind the UDP socket prefers to obtain higher throughput, it uses the MSG_MORE flag to tell ip_append_data to create maximum-size fragments (500 bytes) and the result would be a first fragment of 500 bytes and a second one of 250 bytes. If it does not signal the preference for such buffering, UDP transmits each payload individually (see the section "Putting Together the Transmission Functions").

设置上下文

Setting the context

ip_append_data函数的第一个块初始化一些局部变量,并可能更改一些输入参数。完成的具体工作取决于该函数是sk_write_queue在数据包中创建第一个 IP 片段(在这种情况下队列将为空)还是稍后的 IP 片段。使用第一个元素,ip_append_data初始化inet->cork和 ,inet并使用将由以下ip_append_data(and by ip_push_pending_frames) 调用使用的字段。

The first block of the ip_append_data function initializes some local variables and possibly changes some of the input parameters. The exact work done depends on whether the function is creating the first IP fragment (in which case the sk_write_queue queue would be empty) or a later one within a packet. With the first element, ip_append_data initializes inet->cork and inet with fields that will be used by the following invocation of ip_append_data (and by ip_push_pending_frames).

保存的信息包括 IP 选项和路由表缓存条目。缓存它们可以在后续调用ip_append_data同一数据包时节省时间,但这并不是绝对必要的,因为 ip_append_data的调用者将在所有后续调用中再次传递数据。

Among the information saved is the IP options and the routing table cache entry. Caching them saves time during subsequent calls to ip_append_data for the same packet, but is not strictly necessary because ip_append_data's caller will pass the data again in all of the following calls.

        if (skb_queue_empty(&sk->sk_write_queue)) {
            选择= ipc->选择;
            如果(选择){
                if (inet->cork.opt == NULL) {
                    inet->cork.opt = kmalloc(sizeof(struct ip_options) + 40,
                                             sk->sk_分配);
                               if (不太可能(inet->cork.opt == NULL))
                                       返回-ENOBUFF;
                }
                memcpy(inet->cork.opt, opt,
                       sizeof(struct ip_options)+opt->optlen);
                inet->cork.flags |= IPCORK_OPT;
                inet->cork.addr = ipc->addr;
            }
            dst_hold(&rt->u.dst);
            inet->cork.fragsize = mtu = dst_pmtu(&rt->u.dst);
            inet->cork.rt = rt;
            inet->cork.length = 0;
            sk->sk_sndmsg_page = NULL;
            sk->sk_sndmsg_off = 0;
            if ((exthdrlen = rt->u.dst.header_len) != 0) {
                长度+= exthdrlen;
                transhdrlen += exthdrlen;
            }
        } 别的 {
            rt = inet->cork.rt;
            if (inet->cork.flags & IPCORK_OPT)
                opt = inet->cork.opt;
            转hdrlen = 0;
            扩展长度 = 0;
            mtu = inet->cork.fragsize;
        }
        if (skb_queue_empty(&sk->sk_write_queue)) {
            opt = ipc->opt;
            if (opt) {
                if (inet->cork.opt == NULL) {
                    inet->cork.opt = kmalloc(sizeof(struct ip_options) + 40,
                                             sk->sk_allocation);
                               if (unlikely(inet->cork.opt == NULL))
                                       return -ENOBUFF;
                }
                memcpy(inet->cork.opt, opt,
                       sizeof(struct ip_options)+opt->optlen);
                inet->cork.flags |= IPCORK_OPT;
                inet->cork.addr = ipc->addr;
            }
            dst_hold(&rt->u.dst);
            inet->cork.fragsize = mtu = dst_pmtu(&rt->u.dst);
            inet->cork.rt = rt;
            inet->cork.length = 0;
            sk->sk_sndmsg_page = NULL;
            sk->sk_sndmsg_off = 0;
            if ((exthdrlen = rt->u.dst.header_len) != 0) {
                length += exthdrlen;
                transhdrlen += exthdrlen;
            }
        } else {
            rt = inet->cork.rt;
            if (inet->cork.flags & IPCORK_OPT)
                opt = inet->cork.opt;
            transhdrlen = 0;
            exthdrlen = 0;
            mtu = inet->cork.fragsize;
        }

要理解该函数的其余部分,您需要了解以下关键变量的含义。其中一些由 接收输入ip_append_data;有关其说明,请参阅“ ip_append_data 函数”部分。返回参考图 21-221-8也很有用。

To understand the rest of the function, you need to understand the meaning of the following key variables. Some of them are received in input by ip_append_data; refer to the section "The ip_append_data Function" for their descriptions. It can also be useful to refer back to Figures 21-2 through 21-8.

rt
rt

用于传输IP数据报的路由表缓存条目。该结构体包括下一跳网关、出口设备、PMTU等字段。

Routing table cache entry used to transmit the IP datagram. This structure includes several fields, such as the next hop gateway, the egress device, and the PMTU.

mtu
mtu

与 关联的 PMTU rt

The PMTU associated with rt.

opt
opt

添加到 IP 标头的 IP 选项。当该变量为 NULL 时,没有选项。

IP options to add to the IP header. When this variable is NULL, there are no options.

exthdrlen(外部标头len
exthdrlen (external header len)

transhdrlen(传输头len
transhdrlen (transport header len)

当 L4 层调用时,ip_append_data 它会传递这两个参数,因为在分配缓冲区时需要考虑它们。transhdrlen直接通过;exthdrlen通过 间接检索rt。外部标头的示例是 IPsec 套件中的协议使用的标头,例如身份验证标头 (AH) 和封装安全有效负载 (ESP)。传输标头的示例包括常见 TCP、UDP 和 ICMP 协议的传输标头。

When the L4 layer invokes ip_append_data it passes these two parameters because they need to be taken into account when allocating buffers. transhdrlen is passed directly; exthdrlen is retrieved indirectly via rt. Examples of external headers are the ones used by the protocols in the IPsec suite, such as the Authentication Header (AH) and the Encapsulation Security Payload (ESP). Examples of transport headers are those of the common TCP, UDP, and ICMP protocols.

lengthexthdrlen、 和的初始化方式transhdrlen可能会令人困惑。我将解释为什么它们的值在某些条件下会发生变化。

The way length, exthdrlen, and transhdrlen are initialized may be confusing. I'll explain why their values are changed under some conditions.

正如我们已经看到的,只有第一个片段需要包含传输标头和可选的外部标头。因此,transhdrlenexthdrlen在创建第一个片段后被归零。sk_write_queue正如我们将看到的,如果不为空,这可以在函数的开头完成,或者while在开始第二次迭代之前在大循环内完成。

As we have already seen, only the first fragment needs to include the transport header and the optional external headers. Because of this, transhdrlen and exthdrlen are zeroed after creating the first fragment. As we will see, this can be done right at the beginning of the function if sk_write_queue is not empty, or inside the big while loop before starting a second iteration.

transhdrlen由于此初始化,函数使用 的值来区分第一个片段和以下片段:

Because of this initialization, the value of transhdrlen is used by the function to distinguish between the first fragment and the following ones:

  • transhdrlen ! = 0意味着ip_append_data正在处理第一个片段。

  • transhdrlen ! = 0 means ip_append_data is working on the first fragment.

  • transdhrlen = 0意味着ip_append_data不适用于第一个片段。

  • transdhrlen = 0 means ip_append_data is not working on the first fragment.

相同的逻辑不能应用于exthdrlen,因为每个 IP 数据包都需要 L4 标头,但许多数据包没有外部标头,因为它们不使用 IPsec 等特殊功能。

The same logic cannot be applied to exthdrlen, because the L4 header is needed for every IP packet, but many have no external headers because they don't use special features such as IPsec.

这里初始化的变量稍后有几个重要的用途:

The variables initialized here have several important uses later:

  • 在决定将多少数据复制到每个数据片段中时,该函数需要考虑到第一个片段包含 L4 标头和可选的外部标头,因此可用于有效负载的空间较少(见图 21-2

  • When deciding how much data to copy into each data fragment, the function needs to take into account that the first fragment includes the L4 header and optional external headers, and therefore that less space is available for the payload (see Figure 21-2).

  • 在决定分配多大的缓冲区时,该函数需要考虑外部标头(如果有)所需的额外空间。

  • When deciding how big to allocate the buffers, the function needs to take into account the extra space needed by the external headers (if any).

  • 初始化nh.rawh.raw指针时,函数需要知道是否存在外部标头以及它们所在的位置,以便正确计算数据包内的偏移量。

  • When initializing the nh.raw and h.raw pointers, the function needs to know whether there are external headers and where they are located to correctly compute the offsets within the packet.

为片段生成做好准备

Getting ready for fragment generation

正如我们稍后将看到的,复制到每个生成的片段中的数据量可能会因片段而异。然而,每个片段始终包含 L2 和 L3 标头的固定部分。图21-221-8都显示了这个保留部分。

As we will see later, the amount of data copied into each generated fragment may change from fragment to fragment. However, each fragment always includes a fixed portion for the L2 and L3 headers. Figures 21-2 through 21-8 all show this reserved portion.

在继续之前,该函数定义了以下三个局部变量:

Before proceeding, the function defines the following three local variables:

        hh_len = LL_RESERVED_SPACE(rt->u.dst.dev);
        fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
        maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
        hh_len = LL_RESERVED_SPACE(rt->u.dst.dev);
        fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
        maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;

hh_len是L2报头的长度。当为缓冲区中IP之前的所有报头保留空间时,ip_append_data需要知道L2报头需要多少空间。这样,当设备驱动程序初始化其标头时,它将不需要重新分配空间或在缓冲区内移动数据来为 L2 标头腾出空间。

hh_len is the length of the L2 header. When reserving space for all the headers that precede IP in the buffer, ip_append_data needs to know how much space is needed by the L2 header. This way, when the device driver initializes its header, it will not need to reallocate space or move data inside the buffer to make space for the L2 header.

fraghdrlen是 IP 标头的大小,包括 IP 选项,以及maxfraglen是基于路由PMTU的IP片段的最大大小。

fraghdrlen is the size of the IP header, including the IP options, and maxfraglen is the maximum size of an IP fragment based on the route PMTU.

正如第 18 章“数据包碎片/碎片整理”部分所述,IP 数据包(标头加有效负载)的最大大小为 64 KB。这不仅适用于单个片段,也适用于这些片段最终将重新组装成的完整数据包。因此,跟踪特定数据包接收到的所有数据并拒绝超过 64 KB ( ) 限制。ip_append_data0xFFFF

As explained in the section "Packet Fragmentation/Defragmentation" in Chapter 18, the maximum size of an IP packet (header plus payload) is 64 KB. This applies not just to individual fragments, but also to the complete packet into which those fragments will be reassembled at the end. Thus, ip_append_data keeps track of all the data received for a particular packet and refuses to go over the 64 KB (0xFFFF) limit.

        if (inet->cork.length + length > 0xFFFF - fragheaderlen) {
            ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu-exthdrlen);
            返回-EMSGSIZE;
        }
        inet->cork.length += 长度;
        if (inet->cork.length + length > 0xFFFF - fragheaderlen) {
            ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu-exthdrlen);
            return -EMSGSIZE;
        }
        inet->cork.length += length;

最后的初始化是校验和模式,其值保存在 skb->ip_summed. 请参阅“ L4 校验和”部分。

The last initialization is the checksum mode, the value of which is saved in skb->ip_summed. See the section "L4 checksum."

将数据复制到片段中:getfrag

Copying data into the fragments: getfrag

ip_append_data可以被任何 L4 协议使用。它的任务之一是将输入数据复制到它创建的片段中。不同的协议可能需要对复制的数据应用不同的操作。这种专门操作的一个例子是 L4 校验和的计算,这对于某些 L4 协议来说不是强制性的。另一个区别因素可能是数据的来源。这是本地生成的数据包的用户空间,以及转发的数据包或内核生成的数据包(例如,ICMP 消息)的内核空间。

ip_append_data can potentially be used by any L4 protocol. One of its tasks is to copy the input data into the fragments it creates. Different protocols may need to apply different operations to the data copied. One example of such a specialized operation is the computation of the L4 checksum, which is not compulsory for some L4 protocols. Another distinguishing factor could be the origin of the data. This is user space for locally generated packets, and kernel space for forwarded packets or packets generated by the kernel (e.g., ICMP messages).

与其使用一个共享函数来处理要应用的协议和可选操作的所有可能组合,不如使用针对每个协议的需求量身定制的多个小函数更容易、更清晰。为了 ip_append_data尽可能保持通用,它允许每个协议通过输入参数指定用于复制数据的函数getfrag。换句话说,ip_append_data用于getfrag将输入数据复制到缓冲区中;该复制的结果由图 21-221-9中标记为“L4 有效负载”的存储区域组成。

Instead of having one shared function that takes care of all the possible combinations of protocols and optional operations to apply, it is easier and cleaner to have multiple small functions tailored to each protocol's need. To keep ip_append_data as generic as possible, it allows each protocol to specify the function to use to copy the data by means of the input parameter getfrag. In other words, ip_append_data uses getfrag to copy the input data into the buffers; the result of this copying consists of the memory areas labeled "L4 payload" in Figures 21-2 through 21-9.

表 21-1列出了调用ip_append_data. 另一个函数ip_reply_glue_bits是由(参见“执行传输的关键函数ip_send_reply部分)使用的。

Table 21-1 lists the functions used by the most common L4 protocols that invoke ip_append_data. Another function, ip_reply_glue_bits, is used by ip_send_reply (see the section "Key Functions That Perform Transmission").

getfrag接收四个输入参数(fromtooffsetlen),并简单地lenfromto复制字节to+offset,考虑到它from可能是指向用户空间内存的指针,因此必须进行相应处理(可能需要从用户内存转换到内核内存)。它还负责 L4 校验和:将数据复制到内核缓冲区时,它会skb->csum根据skb->ip_summed配置进行更新。

getfrag receives four input parameters (from, to, offset, and len), and simply copies len bytes from from to to+offset, taking into account that from could be a pointer into user-space memory and thus has to be handled accordingly (it may require translation from user to kernel memory). It also takes care of the L4 checksum: while copying data into the kernel buffer, it updates skb->csum according to the skb->ip_summed configuration.

表 21-1。getfrag例程

Table 21-1. getfrag routines

协议

Protocol

应用程序编程接口

API

ICMP

ICMP

icmp_glue_bits

icmp_glue_bits

UDP协议

UDP

ip_generic_getfrag

ip_generic_getfrag

原始IP

RAW IP

ip_generic_getfrag

ip_generic_getfrag

TCP(通过ip_send_reply

TCP (via ip_send_reply)

ip_reply_glue_bits

ip_reply_glue_bits

在函数输入的来源(用户空间与内核)始终相同的情况下getfrag,函数不需要区分这两种情况。例如:

In a situation where the origin of the getfrag function's input—user space versus kernel—is always the same, the function does not need to distinguish between the two cases. For example:

  • icmp_glue_bits由 ICMP 协议在传输消息时使用。因为 ICMP 消息要么由内核构建,要么从先前接收到的另一个 ICMP 消息派生(因此位于内核内存中),因此icmp_glue_bits知道数据位于内核空间中。

  • icmp_glue_bits is used by the ICMP protocol when transmitting a message. Because the ICMP message is either built by the kernel or derived from another ICMP message previously received (which therefore is in kernel memory), icmp_glue_bits knows the data is in kernel space.

  • 当应用程序sendmsg在 UDP 或原始 IP 套接字上发出系统调用时,内核最终会调用ip_append_data,并ip_generic_getfrag作为getfrag 函数传递。在这种情况下,已知输入数据始终来自用户空间。

  • When an application issues a sendmsg system call on a UDP or raw IP socket, the kernel ends up calling ip_append_data, passing ip_generic_getfrag as the getfrag function. In this case, the input data is known to always come from user space.

让我们仔细看看通用函数ip_generic_getfrag

Let's take a closer look at the generic function ip_generic_getfrag:

    整数
    ip_generic_getfrag(void *from, char *to, int offset, int len, int odd,
                       结构 sk_buff *skb)
    {
        结构 iovec *iov = 来自;

        if (skb->ip_summed == CHECKSUM_HW) {
            if (memcpy_fromiovecend(to, iov, 偏移量, len) < 0)
                返回-EFAULT;
        } 别的 {
            无符号整数 csum = 0;
            if (csum_partial_copy_fromiovecend(to, iov, offset, len, &csum) < 0)
                返回-EFAULT;
            skb->csum = csum_block_add(skb->csum, csum, odd);
        }
        返回0;
    }
    int
    ip_generic_getfrag(void *from, char *to, int offset, int len, int odd,
                       struct sk_buff *skb)
    {
        struct iovec *iov = from;

        if (skb->ip_summed == CHECKSUM_HW) {
            if (memcpy_fromiovecend(to, iov, offset, len) < 0)
                return -EFAULT;
        } else {
            unsigned int csum = 0;
            if (csum_partial_copy_fromiovecend(to, iov, offset, len, &csum) < 0)
                return -EFAULT;
            skb->csum = csum_block_add(skb->csum, csum, odd);
        }
        return 0;
    }

第19章的“ sk_buff结构” 一节解释了、 和 的含义 以及如何使用。在“ L4 校验和”部分中,我们将了解如何 决定 L4 校验和应该在硬件还是软件中计算(或者根本不计算)。在上一个快照中,您可以看到使用两个不同的函数来复制数据(和),具体取决于 L4 校验和是要在硬件中计算还是必须在软件中计算。CHECKSUM_HWskb->csumskb->ip_summedip_append_dataip_generic_getfragmemcpy_fromiovecendcsum_partial_copy_fromiovecend

The section "sk_buff structure" in Chapter 19 explained the meaning of CHECKSUM_HW, and how skb->csum and skb->ip_summed are used. In the section "L4 checksum," we will see how ip_append_data decides whether the L4 checksum should be computed in hardware or software (or not computed at all). In the previous snapshot, you can see that ip_generic_getfrag uses two different functions to copy the data (memcpy_fromiovecend and csum_partial_copy_fromiovecend), based on whether the L4 checksum is going to be computed in hardware or must be computed in software.

缓冲区分配

Buffer allocation

ip_append_data根据以下条件选择要分配的缓冲区的大小:

ip_append_data chooses the size of the buffers to allocate based on:

单次传输与多次传输
Single transmission versus multiple transmissions

如果ip_append_data被告知很快就会有其他传输请求(如果MSG_MORE已设置),则分配更大的缓冲区可能是有意义的,以便将来传输的数据可以合并到同一缓冲区中。请参阅前面的部分“ ip_append_data 的基本内存分配和缓冲区组织”以获取进一步说明。

If ip_append_data is told there will be other transmission requests soon after (if MSG_MORE is set), it could make sense to allocate a bigger buffer so that data from future transmissions can be merged into the same buffer. See the earlier section "Basic memory allocation and buffer organization for ip_append_data" for further explanation.

分散/聚集 I/O
Scatter/Gather I/O

如果设备可以处理分散/聚集 I/O,则片段可以更有效地存储到内存页面中。有关进一步说明,请参阅前面的部分“使用 Scatter Gather I/O 进行 ip_append_data 的内存分配和缓冲区组织”。

If the device can handle Scatter/Gather I/O, fragments could be more efficiently stored into memory pages. See the earlier section "Memory allocation and buffer organization for ip_append_data with Scatter Gather I/O" for further explanation.

下面的代码alloclen根据刚才提到的两点决定要分配的缓冲区的大小( )。如果需要更多数据并且设备无法处理分散/收集 I/O,则将使用最大大小(基于 PMTU)创建缓冲区。如果这些条件中的任何一个不成立,则缓冲区将足够大以容纳当前数据。

The following piece of code decides the size of the buffer to allocate (alloclen) based on the two points just stated. The buffer is created with the maximum size (based on the PMTU) if more data is expected and if the device can't handle Scatter/Gather I/O. If either of those conditions is not true, the buffer is made just large enough to hold the current data.

                if ((flags & MSG_MORE) &&
                    !(rt->u.dst.dev->功能&NETIF_F_SG))
                    alloclen = mtu;
                别的
                    alloclen = datalen + fragheaderlen;

                if (datalen == 长度)
                        alloclen += rt->u.dst.trailer_len;
                if ((flags & MSG_MORE) &&
                    !(rt->u.dst.dev->features&NETIF_F_SG))
                    alloclen = mtu;
                else
                    alloclen = datalen + fragheaderlen;

                if (datalen == length)
                        alloclen += rt->u.dst.trailer_len;

请注意,当ip_append_data生成最后一个片段时,需要考虑尾部的存在(例如 IPsec)。

Note that when ip_append_data generates the last fragment, it needs to take into account the presence of trailers (such as for IPsec).

datalen是要复制到我们分配的缓冲区中的数据量。它的值之前根据三个因素进行初始化:剩余的数据量 ( length)、适合片段的最大数据量 ( fraghdrlen) 以及来自前一个缓冲区的可选进位 (fraggap )。

datalen is the amount of data to be copied into the buffer we are allocating. Its value was previously initialized based on three factors: the amount of data left (length), the maximum amount of data that fits into a fragment (fraghdrlen), and an optional carry from the previous buffer (fraggap).

最后一个组成部分fraggap需要解释。除了最后一个缓冲区(保存最后一个 IP 片段)之外,所有片段都必须遵守以下规则:IP 片段的有效负载大小必须是 8 字节的倍数。因此,当内核分配一个不属于最后一个片段的新缓冲区时,可能需要将前一个缓冲区的尾部的一段数据(大小范围为0到7字节)移动到该缓冲区的头部。新分配的一个。换句话说,fraggap除非满足以下所有条件,否则 为零:

The last component, fraggap, requires an explanation. With the exception of the last buffer (which holds the last IP fragment), all fragments must respect the rule that the size of the payload of an IP fragment must be a multiple of eight bytes. For this reason, when the kernel allocates a new buffer that is not for the last fragment, it may need to move a piece of data (whose size ranges from 0 to 7 bytes) from the tail of the previous buffer to the head of the newly allocated one. In other words, fraggap is zero unless all of the following are true:

  • PMTU 不是八字节的倍数。

  • The PMTU is not a multiple of eight bytes.

  • 当前IP分片的大小尚未达到PMTU。

  • The size of the current IP fragment has not reached the PMTU yet.

  • 当前IP分片的大小已经超过了小于等于PMTU的8字节的最高倍数。

  • The size of the current IP fragment has passed the highest multiple of eight bytes that is less than or equal to the PMTU.

图 21-10显示了一个示例,其中fraggap为非零且alloclen已初始化为mtu。请注意,当内核将数据从当前缓冲区 移动 skb_prev到新缓冲区时skb,它还需要调整 和 上的 L4 校验 skb_prev和(请参阅“ L4 校验和skb”一节)。为了简单起见,该图将缓冲区显示为两个平面内存区域,但它们都可以是分页的(如图21-5 所示)和非分页的(如图21-6 所示):用于移动区域的函数可以处理这两种格式。相同的函数还更新 L4 校验和。fraggapskb_copy_and_csum_bits

Figure 21-10 shows an example where fraggap is nonzero and alloclen has been initialized to mtu. Note that when the kernel moves the data from the current buffer, skb_prev, to the new one, skb, it also needs to adjust the L4 checksum on both skb_prev and skb (see the section "L4 checksum"). The figure shows the buffers as two flat memory areas for simplicity, but they both could be paged (as in Figure 21-5) and nonpages (as in Figure 21-6): the function used to move the fraggap area skb_copy_and_csum_bits can handle both formats. The same function also updates the L4 checksums.

遵守 IP 分段的 8 字节边界规则

图 21-10。遵守 IP 分段的 8 字节边界规则

Figure 21-10. Respecting the 8-byte boundary rule on IP fragments

主循环

Main loop

可能创建额外缓冲区的循环while可能看起来比实际情况更复杂。图 21-11总结了它的工作。

The while loop that potentially creates extra buffers may look more complex than it actually is. Figure 21-11 summarizes its job.

ip_append_data 函数:主循环

图 21-11。ip_append_data 函数:主循环

Figure 21-11. ip_append_data function: main loop

最初, 的值表示 的调用者想要传输的length数据量。ip_append_data然而,一旦进入循环,它的值就代表剩下要处理的数据量。这解释了为什么它的值在每个循环结束时更新以及为什么ip_append_data循环直到length变为零。

Initially, the value of length represents the amount of data that the ip_append_data's caller wants to transmit. However, once the loop is entered, its value represents the amount of data left to handle. This explains why its value is updated at the end of each loop and why ip_append_data loops until length becomes zero.

我们已经知道,MSG_MORE表示 L4 层是否需要更多数据, 表示NETIF_F_SG 设备是否支持 Scatter/Gather I/O。这些设置对循环内的第一个任务没有影响,即在循环内的sk_buff第一个块内分配和初始化结构。if此外,第一个数据片段始终被复制到该sk_buff区域中(见图21-5(a)图21-6(a))。

We already know that MSG_MORE indicates whether the L4 layer expects more data, and that NETIF_F_SG indicates whether the device supports Scatter/Gather I/O. These settings have no effect on the first task within the loop, which is to allocate and initialize sk_buff structures within the first if block inside the loop. Also, the first data fragment is always copied into the sk_buff area (see Figure 21-5(a) and Figure 21-6(a)).

ip_append_data每次发生以下情况之一时,分配一个新sk_buff结构并将其排队:sk_write_queue

ip_append_data allocates a new sk_buff structure and queues it to sk_write_queue every time one of the following occurs:

  • sk_write_queue为空(即第一个片段)。

  • sk_write_queue is empty (that is, for the first fragment).

  • 的最后一个元素sk_write_queue已被完全填充。

  • The last element of sk_write_queue has been filled in completely.

循环之前的代码片段通过在队列为空时强制分配来处理第一种情况:

The piece of code that precedes the loop takes care of the first case by forcing allocation when the queue is empty:

        if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL)
            转到alloc_new_skb;
        if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL)
            goto alloc_new_skb;

循环内的第一部分处理第二种情况。首先,它初始化 copy为当前 IP 片段中剩余的空间量:mtu- skb->len。如果剩余要添加的数据 ( length) 大于可用空间量 ,copy则需要再添加一个 IP 片段。在这种情况下,copy会更新。为了强制执行 8 字节边界规则, copy降低到最接近的 8 字节边界。此时,内核可以决定是否需要分配新的缓冲区(即新的IP片段)。if这是与 0进行比较的条件相关的逻辑copy

The first part inside the loop handles the second case. First it initializes copy to the amount of space that is left in the current IP fragment: mtu - skb->len. If the data left to add (length) is greater than the amount of free space, copy, there is a need for one more IP fragment. In that case, copy is updated. To enforce the 8-byte boundary rule, copy is lowered to the closest 8-byte boundary. At this point, the kernel can decide whether it needs to allocate a new buffer (i.e., a new IP fragment). This is the logic associated with the if condition that compares copy against 0:

copy > 0
copy > 0

这意味着skb( 的最后一个元素 sk_write_queue)有一些可用空间。 ip_append_data首先使用该空间。如果剩余空间不够(即length大于可用空间),则循环将再次迭代,这次将落入下一个类别(见图21-321-4)。

This means that skb (the last element of sk_write_queue) has some space available. ip_append_data first uses that space. If the space left was not sufficient (i.e., length is greater than the space available), the loop will iterate again, and this time it will fall into the next category (see Figures 21-3 and 21-4).

copy = 0
copy = 0

这意味着是时候分配一个新的了,sk_buff因为最后一个已经被完全填满了。在这种情况下,块内的代码if分配一个新的缓冲区,将输入数据复制到缓冲区中,并将新片段排队到sk_write_queue。接下来的片段将被合并到前一个片段或复制到专门为分散/聚集 I/O 分配的内存页中。

This means that it is time to allocate a new sk_buff because the last one has been filled in completely. In this case, the code inside the if block allocates a new buffer, copies the input data into the buffer, and queues the new fragment to sk_write_queue. The following fragments will be either merged to the previous one or copied into memory pages allocated specifically for Scatter/Gather I/O.

copy < 0
copy < 0

这是前一个案例的特例。当copy为负数时,表示必须从当前IP分片中删除一些数据并移动到新的IP分片中。有关更多详细信息,请参阅前面的“缓冲区分配”部分。

This is a special case of the previous one. When copy is negative, it means that some data must be deleted from the current IP fragment and moved to the new one. See the earlier section "Buffer allocation" for more details.

每次新循环结束时,该函数都需要将指针向前移动到要复制的数据 ( offset) 并更新剩余的要复制的数据量 ( length)。一旦片段已使用 排队_ _skb_queue_tail,如果有任何数据剩余,该函数可能需要重新启动循环。

Every time a new loop ends, the function needs to move ahead the pointer to the data to copy (offset) and to update the amount of data left to copy (length). Once the fragment has been queued with _ _skb_queue_tail, the function may need to restart the loop if any data is left.

L4 校验和

L4 checksum

我们在第 19 章的“ net_device 结构”一节中看到,L3 和 L4 校验和当出口 NIC 的设备驱动程序通过在 中设置正确的标志来通告该功能时,可以由出口 NIC 进行计算dev->features。特别是,skb->ip_summed(最终skb->csum)必须初始化以显示出口设备是否提供对 L4 硬件校验和的支持。请参阅上述部分了解更多详细信息。

We saw in the section "net_device structure" in Chapter 19 that the L3 and L4 checksums can be computed by the egress NIC when its device driver advertises that capability by setting the right flags in dev->features. In particular, skb->ip_summed (and eventually skb->csum) must be initialized to show whether the egress device provides support for L4 hardware checksumming. Refer to the aforementioned section for more details.

ip_append_data是否可以使用硬件校验和由第一个片段调用时决定(即transhdrlen非零)。仅当满足以下所有条件时,硬件校验和才适用:

Whether hardware checksumming can be used is decided when ip_append_data is called for the first fragment (i.e., transhdrlen is nonzero). Hardware checksumming is applicable only when all of the following conditions are met:

  • 构建的IP数据包ip_append_data不会被分段(即,馈送的总数据ip_append_data不超过PMTU)。

  • The IP packet built by ip_append_data is not going to be fragmented (i.e., the total data fed to ip_append_data does not exceed the PMTU).

  • 出口设备支持硬件校验和。

  • The egress device supports hardware checksumming.

  • 没有转换标头(即IPsec 套件的协议)。例如,此类转换可以压缩或加密 NIC 在计算校验和时应读取的数据。这些转换还在 IP 标头和 L4 标头之间插入附加标头。这意味着 L4 硬件校验和和 IPsec 转换不能共存。

  • There are no transformation headers (i.e., protocols of the IPsec suite). Such transformations can, for example, compress or encrypt the data the NIC is supposed to read when computing the checksum. These transformations also insert additional headers between the IP header and the L4 header. This means that L4 hardware checksumming and IPsec transformations cannot coexist.

在其他条件下,硬件校验和可能也必须关闭。

Hardware checksumming might also have to be turned off under other conditions.

上一个列表中的第一个项目符号需要解释。当 IP 数据包分段时,硬件校验和不起作用(如图 21-3中的示例)。然而,由于 ip_append_data在实际传输发生之前(即调用之前)可以多次调用ip_push_pending_frames,因此第一次调用时IP层可能不知道需要分片ip_append_data,因此最初的决定仅基于输入数据(length) :如果基于 需要分片 length,则不使用硬件校验和。

The first bullet in the previous list requires an explanation. Hardware checksumming does not work when the IP packet is fragmented (as in the example in Figure 21-3). However, because ip_append_data can be called several times before the actual transmission takes place (i.e., before ip_push_pending_frames is called), the IP layer may not know that fragmentation is required when ip_append_data is first called and therefore the initial decision is based only on the input data (length): if fragmentation is required based on length, hardware checksumming is not used.

        if (transhdrlen &&
            长度 + Fragheaderlen <= mtu &&
            rt->u.dst.dev->功能&(NETIF_F_IP_CSUM|NETIF_F_NO_CSUM|NETIF_F_HW_CSUM) &&
            !exthdrlen)
                 csummode = CHECKSUM_HW;
        if (transhdrlen &&
            length + fragheaderlen <= mtu &&
            rt->u.dst.dev->features&(NETIF_F_IP_CSUM|NETIF_F_NO_CSUM|NETIF_F_HW_CSUM) &&
            !exthdrlen)
                 csummode = CHECKSUM_HW;

此处初始化的局部变量csummode将被分配到skb->ip_summed第一个缓冲区。如果需要分段并ip_append_data相应地分配更多缓冲区(每个 IP 分段一个),skb->ip_summed则后续缓冲区将设置为CHECKSUM_NONE. 当getfrag调用将数据复制到缓冲区时,它还会处理 L4 校验和skb->ip_summed如果它传递了一个初始化为的缓冲区(请参阅“将数据复制到片段中:getfragCHECKSUM_NONE ”部分)。

The local variable csummode initialized here will be assigned to skb->ip_summed on the first buffer. If there is a need for fragmentation and ip_append_data allocates more buffers accordingly (one for each IP fragment), skb->ip_summed on the subsequent buffers will be set to CHECKSUM_NONE. When getfrag is called to copy the data into the buffers, it also takes care of the L4 checksum if it is passed a buffer with skb->ip_summed initialized to CHECKSUM_NONE (see the section "Copying data into the fragments: getfrag").

请注意,ip_append_data仅对 L4 有效负载进行校验和。在第18章的“ L4校验和的更改”部分中,我们看到L4校验和必须包括L4标头以及所谓的伪标头。如果在只有一个 IP 片段且出口设备支持硬件校验和时由 L4 层调用,则 L4 协议只需使用伪标头校验和初始化为正确的偏移量和 L4 标头的校验和字段,如我们在“部分”中看到的sk_buff结构体“中ip_push_pending_framessk_write_queueskb->csum第19章sk_write_queue如果出口设备不支持硬件校验和,或者支持硬件校验和但由于具有多个 IP 片段而无法使用,则必须在软件中计算 L4 校验和。在这种情况下,getfrag在将数据复制到缓冲区时计算 L4 有效负载上的部分校验和,L4 协议稍后会将它们组合起来以获得要放入 L4 标头中的值。请参阅“组合传输函数”部分,了解 UDP 在调用之前如何处理 L4 校验和 ip_push_pending_frames

Note that ip_append_data checksums only the L4 payloads. In the section "Changes to the L4 Checksum" in Chapter 18, we saw that the L4 checksum must include the L4 header as well as the so-called pseudoheader. If ip_push_pending_frames is called by the L4 layer when sk_write_queue has only one IP fragment and the egress device supports hardware checksumming, the L4 protocol only needs to initialize skb->csum to the right offset and the L4 header's checksum field with the pseudoheader checksum, as we saw in the section "sk_buff structure" in Chapter 19. If instead the egress device does not support hardware checksumming, or the latter is supported but cannot be used because sk_write_queue has more than one IP fragment, the L4 checksum must be computed in software. In this case, getfrag computes the partial checksums on the L4 payloads while copying data into the buffers, and the L4 protocol will combine them later to get the value to put into the L4 header. See the section "Putting Together the Transmission Functions" to see how UDP takes care of the L4 checksum before invoking ip_push_pending_frames.

有关设备驱动程序如何指示 NIC 在需要时计算 L4 硬件校验和的示例,请参阅drivers/net/3c59x.cdrivers/net/8139cp.c 中的boomerang_start_xmit例程。在这两种情况下,您还可以看到在设置 DMA 传输时如何处理分页。cp_start_xmitskb

For an example of how a device driver instructs the NIC to compute the L4 hardware checksum when required, see the boomerang_start_xmit routine in drivers/net/3c59x.c and cp_start_xmit in drivers/net/8139cp.c. In both cases, you can also see how a paged skb is handled when setting up the DMA transfers.

ip_append_page 函数

The ip_append_page Function

我们在“将数据复制到片段:getfrag ”一节中看到,来自用户空间的传输请求(使用类似 的调用sendmsg)需要进行复制以将数据从用户空间传输到内核空间。该副本是由getfrag作为输入参数传递给 的函数创建的ip_append_data

We saw in the section "Copying data into the fragments: getfrag" that a transmission request from user space, with a call like sendmsg, requires a copy to move the data to transmit from user space to kernel space. This copy is made by the getfrag function passed as an input parameter to ip_append_data.

内核为用户空间应用程序提供了另一个接口,sendfile它允许应用程序优化传输并进行数据复制。该接口被广泛宣传为“零拷贝”TCP/UDP

The kernel provides user-space applications with another interface, sendfile, which allows applications to optimize the transmission and make the data copy. This interface has been widely publicized as "zero-copy" TCP/UDP .

sendfile仅当出口设备支持Scatter/Gather I/O时才能使用该接口。在这种情况下,可以简化实现的逻辑ip_append_data,从而不需要复制(即,用户要求传输的数据留在原处)。内核仅frag使用输入中接收到的数据缓冲区的位置来初始化向量,并在需要时处理 L4 校验和。这种简化的逻辑是由 提供的ip_append_page。当ip_append_data 使用指针接收数据的位置时void*ip_append_page接收该位置作为指向内存页的指针以及其中的偏移量,这使得初始化 的一个条目变得简单frag

The sendfile interface can be used only when the egress device supports Scatter/Gather I/O . In this case, the logic implemented by ip_append_data can be simplified so that no copy is necessary (i.e., the data the user asked to transmit is left where it is). The kernel just initializes the frag vector with the location of the data buffer received in input, and takes care of the L4 checksum if needed. This simplified logic is what is provided by ip_append_page. While ip_append_data receives the location of the data with a void* pointer, ip_append_page receives the location as a pointer to a memory page and offset within it, which makes it straightforward to initialize one entry of frag.

与 Scatter/Gather I/O唯一不同的代码ip_append_data如下:

The only piece of code that differs from ip_append_data with regard to Scatter/Gather I/O is the following:

            i = skb_shinfo(skb)->nr_frags;
            if (长度 > 大小)
                长度=大小;
            if (skb_can_coalesce(skb, i, 页, 偏移量)) {
                skb_shinfo(skb)->frags[i-1].size += len;
            } 否则如果 (i < MAX_SKB_FRAGS) {
                获取页面(页面);
                skb_fill_page_desc(skb, i, 页, 偏移量, len);
            } 别的 {
                错误=-EMSGSIZE;
                转到错误;
            }

            if (skb->ip_summed == CHECKSUM_NONE) {
                无符号整型 csum;
                csum = csum_page(页, 偏移量, len);
                skb->csum = csum_block_add(skb->csum, csum, skb->len);
            }
            i = skb_shinfo(skb)->nr_frags;
            if (len > size)
                len = size;
            if (skb_can_coalesce(skb, i, page, offset)) {
                skb_shinfo(skb)->frags[i-1].size += len;
            } else if (i < MAX_SKB_FRAGS) {
                get_page(page);
                skb_fill_page_desc(skb, i, page, offset, len);
            } else {
                err = -EMSGSIZE;
                goto error;
            }

            if (skb->ip_summed == CHECKSUM_NONE) {
                unsigned int csum;
                csum = csum_page(page, offset, len);
                skb->csum = csum_block_add(skb->csum, csum, skb->len);
            }

当向页面添加新片段时,ip_append_page首先尝试将新片段与页面中已有的前一个片段合并。为此,它首先通过 来检查skb_can_coalesce应添加新点的点是否与上一个点结束的点匹配。如果可以进行合并,它所要做的就是更新页面中已有的前一个片段的长度以包含新数据。

When adding a new fragment to a page, ip_append_page tries first to merge the new one with the previous fragment already in the page. To do that, it first checks, by means of skb_can_coalesce, whether the point where the new one should be added matches with the point where the last one ends. If merging is possible, all it has to do is update the length of the previous fragment already in the page to include the new data.

当无法合并时,该函数将使用 初始化新片段 skb_fill_page_desc。在这种情况下,它还会增加页面上的引用计数get_page。引用计数必须递增,因为ip_append_page使用它接收到的页面作为输入,并且该页面也可能被其他人使用。

When merging is not possible, the function initializes the new fragment with skb_fill_page_desc. In this case, it also increments the reference count on the page with get_page. The reference count must be incremented because ip_append_page uses the page it receives as input, and this page could potentially be used by someone else, too.

ip_append_page目前仅由 UDP 使用。我们说 TCP 不使用ip_append_dataip_push_pending_frames函数,因为它在 中实现了相同的逻辑tcp_sendmsg。这同样适用于这个零拷贝接口:TCP 不使用ip_append_page,而是在 中实现相同的逻辑do_tcp_sendpage。与 UDP 不同,只有当出口设备支持 L4 硬件校验和时,TCP 才允许应用程序使用零拷贝接口。[ * ]

ip_append_page is currently used by UDP only. We said that TCP does not use the ip_append_data and ip_push_pending_frames functions because it implements the same logic in tcp_sendmsg. The same applies to this zero-copy interface: TCP does not use ip_append_page, but implements the same logic in do_tcp_sendpage. Unlike UDP, TCP allows the application to use the zero-copy interface only if the egress device supports L4 hardware checksumming.[*]

ip_push_pending_frames 函数

The ip_push_pending_frames Function

正如本章开头附近所解释的,与和ip_push_pending_frames 一起工作。当 L4 层决定打包并传输排队通过或的 片段时(由于某些特定于协议的标准或因为更高级别的应用程序明确告知它发送数据),它只需调用:ip_append_dataip_append_pagesw_write_queueip_append_dataip_append_pageip_push_pending_frames

As explained near the beginning of this chapter, ip_push_pending_frames works in tandem with ip_append_data and ip_append_page. When the L4 layer decides it is time to wrap up and transmit the fragments queued to sw_write_queue through ip_append_data or ip_append_page (either because of some protocol-specific criterion or because it is explicitly told by the higher-level application to send the data), it simply calls ip_push_pending_frames:

    int ip_push_pending_frames(struct sock *sk)
    int ip_push_pending_frames(struct sock *sk)

该函数接收sock输入中的结构。它需要访问几个字段,特别是指向套接字sk_write_queue结构的指针。

The function receives a sock structure in input. It needs access to several fields, notably the pointer to the socket's sk_write_queue structure.

我们在“使用 Scatter Gather I/O 进行 ip_append_data 的内存分配和缓冲区组织”一节中看到,数据包中的数据在结构中的组织方式不同sk_buff,具体取决于是否使用 Scatter/Gather I/O。

We saw in the section "Memory allocation and buffer organization for ip_append_data with Scatter Gather I/O" that the data in the packet is organized differently in the sk_buff structure, depending on whether Scatter/Gather I/O is used.

这半部分中的代码将第一个缓冲区之后的所有缓冲区排队到一个名为该列表的第一个元素的一部分中,如图21-12frag_list所示,并更新列表头部的缓冲区的和字段以说明对于所有的片段。执行最后一个操作是因为它对代码路径中稍后出现的例程 很有用(请参见第 18 章中的图 18-1,并参见第 22 章)。当缓冲区排队到 时,它们会被清除lendata_lenip_fragmentfrag_listsk_write_queue。创建新列表只需要很少的时间(不复制数据;仅更改指针),结果是释放列表sk_write_queue,因此允许 L4 层考虑传输的数据。数据现在脱离了 L4 层的控制,完全由 IP 层负责。

The code in this half queues all the buffers that follow the first one into a list named frag_list that is part of the first element, as shown in Figure 21-12, and updates the len and data_len fields of the buffer at the head of the list to account for all of the fragments. This last operation is performed because it is useful to the ip_fragment routine that comes later in the code path (see Figure 18-1 in Chapter 18, and see Chapter 22). As buffers are queued onto frag_list, they are cleared off of sk_write_queue. It requires very little time to create the new list (no data is copied; only pointers are changed) and the result is to free the sk_write_queue list, which therefore allows the L4 layer to consider the data transmitted. The data is now out of the hands of the L4 layer and completely under the care of the IP layer.

请记住,当您查看图 21-12时,它nr_frags反映了分散/聚集 I/O 缓冲区的数量,而不是 IP 片段的数量。图21-12有两点值得一提:

Remember, as you look at Figure 21-12, that nr_frags reflects the number of Scatter/Gather I/O buffers, and not the number of IP fragments. Two points are worth mentioning about Figure 21-12:

  • 图 21-12(a)ip_push_pending_frames示例中所示的输入 反映了无分散/聚集的情况(即,未使用向量)。使用 Scatter/Gather,您将获得如图 21-7所示的缓冲区列表。frags

  • The input to ip_push_pending_frames shown in the example in Figure 21-12(a) reflects the no Scatter/Gather case (i.e., no use of the frags vector). With Scatter/Gather, you would have a list of buffers like the one in Figure 21-7.

  • 该块仅显示在图 21-12(b)skb_shinfo中使用它的缓冲区上,但它也存在于所有其他结构中。sk_buff

  • The skb_shinfo block is shown only on the buffer in Figure 21-12(b) that uses it, but it is there for all the other sk_buff structures, too.

之后,就可以填写 IP 标头了。如果有多个片段,则只有第一个片段的 IP 标头将由ip_push_pending_frames;填充 其他的稍后会处理(我们将在第 22 章中看到如何处理)。

After that, it is time to fill in the IP header. If there are multiple fragments, only the first is going to have its IP header filled in by ip_push_pending_frames; the others will be taken care of later (we will see how in Chapter 22).

IP 标头 ( ) 的 TTL 字段的设置iph->ttl取决于目标地址是否为组播。通常,多播流量使用较小的值,因为多播最常用于传送流式(有时是交互式)数据,例如音频和视频,如果接收得太晚,这些数据可能会变得无用。分配给多播和单播数据包的 TTL 字段的默认值分别为 1 和 64。[ * ]

The setting of the TTL field of the IP header (iph->ttl) depends on whether the destination address is multicast. Usually, a smaller value is used for multicast traffic because multicasting is most often used to deliver streaming (and sometimes interactive) data such as audio and video that can become useless if it is received too late. The default values assigned to the TTL field for multicast and unicast packets are 1 and 64, respectively.[*]

        if (rt->rt_type == RTN_MULTICAST)
            ttl = inet->mc_ttl;
        别的
            ttl = ip_select_ttl(inet, &rt->u.dst);
        ...
        iph->ttl = ttl;
        if (rt->rt_type == RTN_MULTICAST)
            ttl = inet->mc_ttl;
        else
            ttl = ip_select_ttl(inet, &rt->u.dst);
        ...
        iph->ttl = ttl;
(a) 从 sk_write_queue 队列中删除缓冲区之前和 (b) 之后

图 21-12。(a) 从 sk_write_queue 队列中删除缓冲区之前和 (b) 之后

Figure 21-12. (a) Before and (b) after removing buffers from the sk_write_queue queue

如果标头中有 IP 选项,ip_options_build则用于处理它们。最后一个输入参数设置 ip_options_build为零,以告诉 API 它正在填充第一个片段的选项。这种区别是必要的,因为第一个片段的 IP 选项的处理方式不同,正如我们在第 18 章的“ IP 选项”部分中看到的那样。标头的长度也会更新以反映选项的长度。

If there are IP options in the header, ip_options_build is used to take care of them. The last input parameter to ip_options_build is set to zero to tell the API that it is filling in the options of the first fragment. This distinction is needed because the first fragment's IP options are treated differently, as we saw in the section "IP Options" in Chapter 18. The length of the header is also updated to reflect the length of the options.

        if (inet->cork.flags & IPCORK_OPT)
            opt = inet->cork.opt;
        ...
        iph->ihl = 5;
        如果(选择){
            iph->ihl += opt->optlen>>2;
            ip_options_build(skb, opt, inet->cork.addr, rt, 0);
        }
        if (inet->cork.flags & IPCORK_OPT)
            opt = inet->cork.opt;
        ...
        iph->ihl = 5;
        if (opt) {
            iph->ihl += opt->optlen>>2;
            ip_options_build(skb, opt, inet->cork.addr, rt, 0);
        }

当套接字IP_DF的配置强制在所有数据包上使用该标志(即 IP_PMTUDISC_DO),并且当路由rt启用了 PMTU(即IP_PMTUDISC_WANT)且未锁定(请参阅 的定义ip_dont_fragment)时,将设置 IP 标头的“不分段”标志:[ * ]

The Don't Fragment flag IP_DF of the IP header is set when the socket's configuration enforces the use of that flag on all packets (i.e., IP_PMTUDISC_DO), and when the route rt has PMTU enabled (i.e., IP_PMTUDISC_WANT) and not locked (see the definition of ip_dont_fragment):[*]

    if (inet->pmtudisc != IP_PMTUDISC_DO)
            skb->local_df = 1
        ...
        if (inet->pmtudisc == IP_PMTUDISC_DO ||
            skb->len <= dst_mtu(&rt->u.dst) &&
            ip_dont_fragment(sk, &rt->u.dst)))
            df = htons(IP_DF);
        ...
        iph->frag_off = df;
    if (inet->pmtudisc != IP_PMTUDISC_DO)
            skb->local_df = 1
        ...
        if (inet->pmtudisc == IP_PMTUDISC_DO ||
            skb->len <= dst_mtu(&rt->u.dst) &&
            ip_dont_fragment(sk, &rt->u.dst)))
            df = htons(IP_DF);
        ...
        iph->frag_off = df;

刚刚分配给该变量的值df反映了数据包的“不分段”状态,进而决定了 IP 数据包 ID 的设置方式。第 23 章中的“选择 IP 标头的 ID 字段”部分 更详细地介绍了如何计算该 ID。

The value just assigned to the df variable, reflecting the packet's Don't Fragment status, determines in turn how the IP packet ID is set. The section "Selecting the IP Header's ID Field" in Chapter 23 goes into more detail on how that ID is computed.

        如果(!df){
            _ _ip_select_ident(iph, &rt->u.dst, 0);
        } 别的 {
            iph->id = htons(inet->id++);
        }
        if (!df) {
            _ _ip_select_ident(iph, &rt->u.dst, 0);
        } else {
            iph->id = htons(inet->id++);
        }

skb->priority流量控制使用它来决定将数据包排入哪一个传出队列。请参阅“构建 IP 标头ip_queue_xmit”一节中的类似初始化。

skb->priority is used by Traffic Control to decide which one of the outgoing queues to enqueue the packet in. See the similar initialization by ip_queue_xmit in the section "Building the IP header."

        iph->版本=4;
        iph->tos = inet->tos;
        iph->tot_len = htons(skb->len);
        iph->协议 = sk->sk_protocol;
        iph->saddr = rt->rt_src;
        iph->daddr = rt->rt_dst;
        ip_send_check(iph);
        skb->优先级 = sk->sk_priority;
        skb->dst = dst_clone(&rt->u.dst);
        iph->version = 4;
        iph->tos = inet->tos;
        iph->tot_len = htons(skb->len);
        iph->protocol = sk->sk_protocol;
        iph->saddr = rt->rt_src;
        iph->daddr = rt->rt_dst;
        ip_send_check(iph);
        skb->priority = sk->sk_priority;
        skb->dst = dst_clone(&rt->u.dst);

在传递缓冲区以dst_output完成传输之前,该函数需要请求 Netfilter 权限才能执行此操作。请注意,对于数据包的所有片段,Netfilter 仅查询一次。在早期版本的内核 (2.4) 中,会向 Netfilter 查询每个片段。这使 Netfilter 有机会以更高的粒度过滤 IP 数据包,但它也迫使 Netfilter 对数据包进行碎片整理和重新碎片整理,以防存在检查 L4 或更高级别的过滤器。就其所提供的价值而言,管理费用被认为过于沉重。

Before passing the buffer to dst_output to complete the transmission, the function needs to ask Netfilter permission to do so. Note that Netfilter is queried only once for all the fragments of a packet. In an earlier version of the kernel (2.4), Netfilter was queried for each fragment. This gave Netfilter the chance to filter IP packets with a higher granularity, but it also forced Netfilter to defragment and refragment packets in case there were filters that examined the L4 or higher levels. The overhead was judged too burdensome for the value it offered.

请注意,当dst_input传递一个 sk_buff缓冲区列表(而不是单个缓冲区)时,如图21-12(b)所示,只有第一个缓冲区的 IP 标头被初始化。我们将在第 22 章中看到如何处理这样的列表 ip_fragment

Note that when dst_input is passed a list of sk_buff buffers (as opposed to a single buffer), as shown in Figure 21-12(b), only the first one gets its IP header initialized. We will see in Chapter 22 how such a list is taken care of by ip_fragment.

        错误 = NF_HOOK(PF_INET, NF_IP_LOCAL_OUT, skb, NULL,
                  skb->dst->dev, dst_output);
        err = NF_HOOK(PF_INET, NF_IP_LOCAL_OUT, skb, NULL,
                  skb->dst->dev, dst_output);

返回之前,该函数会清除该IPCORK_OPT字段,从而使结构的内容无效cork。这是因为后来发送到同一目的地的数据包会重用该cork结构,并且 IP 层需要知道何时应丢弃旧数据。

Before returning, the function clears the IPCORK_OPT field, which invalidates the contents of the cork structure. This is because later packets to the same destination reuse the cork structure, and the IP layer needs to know when old data should be thrown away.

整合传输功能

Putting Together the Transmission Functions

为了了解我们一直在研究的函数 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、ip_append_data、、ip_push_pending_framesudp_sendmsg

To see how the functions we've been examining, ip_append_data and ip_push_pending_frames, work together, let's focus on a function called by the UDP layer, udp_sendmsg, and see how it calls them.

    int udp_sendmsg(结构 kiocb *iocb, 结构 sock *sk, 结构 msghdr *msg,
            size_t 长度)
    {
        …………
        struct udp_opt *up = udp_sk(sk);
        …………
        int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
        …………
        err = ip_append_data(sk, ip_generic_getfrag, msg->msg_iov, ulen,
                sizeof(struct udphdr), &ipc, rt,
                科克雷克?msg->msg_flags|MSG_MORE : msg->msg_flags);
        如果(错误)
            udp_flush_pending_frames(sk);
        否则如果 (!corkreq)
            err = udp_push_pending_frames(sk, 向上);
    int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
            size_t len)
    {
        ... ... ...
        struct udp_opt *up = udp_sk(sk);
        ... ... ...
        int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
        ... ... ...
        err = ip_append_data(sk, ip_generic_getfrag, msg->msg_iov, ulen,
                sizeof(struct udphdr), &ipc, rt,
                corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
        if (err)
            udp_flush_pending_frames(sk);
        else if (!corkreq)
            err = udp_push_pending_frames(sk, up);

本地标志corkreq根据多种因素进行初始化,并将被传递ip_append_data以指示是否应使用缓冲。这些因素包括:

The local flag corkreq is initialized based on multiple factors, and will be passed to ip_append_data to signal whether buffering should be used. Among those factors are:

MSG_MORE
MSG_MORE

该标志可以针对每个传输请求单独设置或清除。

This flag can be set or cleared individually on each transmission request.

corkflag (UDP_CORK)
corkflag (UDP_CORK)

这将应用于套接字一次并保持活动状态,直到明确禁用为止。

This is applied once to a socket and remains active until explicitly disabled.

这两个标志具有类似的用途。经过一番关于哪一个最好的讨论后,最终它们都在内核中可用。

These two flags have a comparable purpose. After some discussion over which was the best one, in the end both of them were made available in the kernel.

udp_sendmsg首先调用,然后仅当为false时才ip_append_data强制立即传输数据 。如果由于任何原因失败,请使用 刷新队列,这是 IP 函数的包装器。udp_push_pending_framescorkreqip_append_dataudp_sendmsgudp_flush_pending_framesip_flush_pending_frames

udp_sendmsg first calls ip_append_data, and then forces the immediate transmission of the data with udp_push_pending_frames only if corkreq is false. In case ip_append_data failed for any reason, udp_sendmsg flushes the queue with udp_flush_pending_frames, which is a wrapper for the IP function ip_flush_pending_frames.

图 21-13显示了 的内部结构udp_push_pending_frames。请注意如何根据我们在“ L4 校验和”部分中看到的逻辑来处理 L4 校验和。

Figure 21-13 shows the internals of udp_push_pending_frames. Note how the L4 checksum is handled according to the logic we saw in the section "L4 checksum."

udp_push_pending_frames函数

图 21-13。udp_push_pending_frames函数

Figure 21-13. udp_push_pending_frames function

有关如何使用的示例ip_append_page,您可以看一下udp_sendpage

For an example of how to use ip_append_page, you can take a look at udp_sendpage.

原始套接字

Raw Sockets

原始套接字是可能的(使用原始 IP 的套接字)将 IP 标头包含在它们传递到 IP 层的数据中。这意味着可以要求IP层发送一段已经包含初始化IP报头的数据。为此,原始 IP 使用IP_HDRINCL (包含标头)选项,该选项可以通过系统调用等进行设置setsockopt(请参阅ip_setsockopt例程)。

It is possible for raw sockets (sockets using raw IP) to include the IP header in the data they pass to the IP layer. This means that the IP layer can be asked to send a piece of data that already includes an initialized IP header. To do this, raw IP uses the IP_HDRINCL (header included) option, which can be set, for instance, with the setsockopt system call (see the ip_setsockopt routine).

设置此选项后,既不使用ip_push_ pending_frames也不ip_queue_xmit使用。原始 IP 直接调用dst_output。有关示例,请参见 raw_sendmsgraw_send_hdrinc函数。

When this option is set, neither ip_push_ pending_frames nor ip_queue_xmit is used. Raw IP directly invokes dst_output instead. See the raw_sendmsg and raw_send_hdrinc functions for examples.

与相邻子系统的接口

Interface to the Neighboring Subsystem

如第 18 章中的图 18-1所示,传输以对 的调用结束 。后者是 Netfilter 挂钩点的简单包装。请注意,不遵循命名约定+ ,而是遵循约定 + 。在第27章“相邻协议和L3传输功能之间的交互”部分中进行了描述。ip_finish_outputip_finish_outputdo_somethingdo_something_finishdo_somethingdo_something2ip_finish_output2

As shown in Figure 18-1 in Chapter 18, transmissions end with a call to ip_finish_output. The latter is a simple wrapper for a Netfilter hook point. Note that ip_finish_output does not follow the naming convention do_something + do_something_finish, but instead the convention do_something + do_something2. ip_finish_output2 is described in the section "Interaction Between Neighboring Protocols and L3 Transmission Functions" in Chapter 27.

    int ip_finish_output(struct sk_buff *skb)
    {
        struct net_device *dev = skb->dst->dev;

        skb->dev = dev;
        skb->协议 = _ _constant_htons(ETH_P_IP);

        返回 NF_HOOK(PF_INET, NF_IP_POST_ROUTING, skb, NULL, dev,
                   ip_finish_output2);
    }
    int ip_finish_output(struct sk_buff *skb)
    {
        struct net_device *dev = skb->dst->dev;

        skb->dev = dev;
        skb->protocol = _ _constant_htons(ETH_P_IP);

        return NF_HOOK(PF_INET, NF_IP_POST_ROUTING, skb, NULL, dev,
                   ip_finish_output2);
    }

当一切最终就位(包括 L2 标头)时,dev_queue_xmit将调用该函数(通过hh->hh_outputdst->neighbour->output)来完成传输的“艰苦工作”。我们已经在第 11 章中详细讨论了该函数的工作原理。

When everything is finally in place (including the L2 header), the dev_queue_xmit function is called (via hh->hh_output or dst->neighbour->output) to do the "hard job" of transmission. We already discussed in detail how that function works in Chapter 11.




[ * ]我们在第 20 章的“ ip_forward 函数”部分看到了类似的内容。

[*] We saw something similar in the section "ip_forward Function" in Chapter 20.

[ * ]可能不太清楚,通过查看ip_append_data如何MSG_PROBE使用PMTU进行测试。请参阅net/ipv4/raw.craw_send_hdrinc中的示例。

[*] It may not be clear, by looking at ip_append_data, how MSG_PROBE can be used to test the PMTU. See raw_send_hdrinc in net/ipv4/raw.c for an example.

[ * ]缓冲区左侧的指针是sk_buff的字段,右侧的指针是ip_append_data的局部变量。

[*] The pointers on the left side of the buffer are sk_buff's fields, and the ones on the right side are ip_append_data's local variables.

[ * ]我们将在“缓冲区分配”一节中看到,如果 PMTU 不是 8 字节的倍数,则所有片段(最后一个片段除外)的大小都会缩短到最接近的 8 字节边界。

[*] We will see in the section "Buffer allocation" that if the PMTU is not a multiple of eight bytes, the size of all fragments (with the exception of the last one) is shortened to the closest 8-byte boundary.

[ * ]为了简单起见,我忽略了标头开销。

[*] I'm ignoring the header overhead for the sake of simplicity.

[ * ]如果设备不需要 L4 校验和,也可以使用零复制。参见第19章NETIF_F_NO_CSUM的描述。

[*] Zero-copy can also be used if the device does not require an L4 checksum. See the description of NETIF_F_NO_CSUM in Chapter 19.

[ * ]可以使用 更改任一值,但只能使用/procip_setsockopt接口设置单播值(请参阅第 23 章中的“通过 /proc 文件系统进行调整”部分)。

[*] Either value can be changed with ip_setsockopt, but only the unicast value can be set with the /proc interface (see the section "Tuning via /proc Filesystem" in Chapter 23).

[ * ] PMTU 是可以分配给路由的度量之一。当指标被锁定时,它不能被协议事件更改。第 30 章和36章介绍了度量标准。

[*] The PMTU is one of the metrics that can be assigned to routes. When a metric is locked, it cannot be changed by protocol events. Metrics are introduced in Chapters 30 and 36.

第 22 章 Internet 协议版本 4 (IPv4):处理碎片

Chapter 22. Internet Protocol Version 4 (IPv4): Handling Fragmentation

分段和碎片整理是复杂的任务,因为主机的 IP 层在对数据包进行分段和碎片整理时可以接收多种输入。我们已经看到了许多涉及分段的工作,作为前面有关 IPv4 的章节中所示功能的一部分。本章描述了在net/ipv4/ip_output.cip_fragment中定义的函数,其中所有这些努力都达到了最终的顶峰,并导致准备好传输的单独数据包。本章还描述了net/ipv4/ip_fragment.c中定义的相应函数,其中传入的片段在通过以下方式传递到 L4 层之前被重新组装成数据包:ip_defragip_local_deliver。每个部分也描述了辅助函数。

Fragmentation and defragmentation are complex tasks because of the variety of inputs that the IP layer of a host can receive both when fragmenting and when defragmenting a packet. We have seen much of the work that goes into fragmentation as part of the functions shown in previous chapters on IPv4. This chapter describes the ip_fragment function, which is defined in net/ipv4/ip_output.c, where all of these efforts reach their final culmination and result in separate packets ready to transmit. This chapter also describes the corresponding ip_defrag function, defined in net/ipv4/ip_fragment.c, where incoming fragments are reassembled into a packet prior to being passed to the L4 layer via ip_local_deliver. Helper functions are described in each section as well.

这两个功能可以被除IPv4之外的其他子系统使用。例如,当 Netfilter 被迫对 IP 数据包进行碎片整理(和重新碎片整理)以便能够访问 L3 层之上的标头字段时,Netfilter 就会使用它们。这对于转发的数据包来说是必需的,并且在第 21 章的“ ip_push_pending_frames 函数”部分中进行了讨论。

These two functions can be used by other subsystems besides IPv4. For example, Netfilter uses them when it is forced to defragment (and refragment) an IP packet to be able to access header fields above the L3 layer. This is necessary mostly for forwarded packets and was discussed in the section "The ip_push_pending_frames Function" in Chapter 21.

IP 层如何识别数据包是较大数据包的片段?根据我们在第 17 章中看到的内容,我们需要 IP 标头的Offset和字段来判断。MF如果数据包没有被分片,Offset则 =0 和MF=0。相反,如果我们手上有一个片段,则以下情况成立:

How does the IP layer recognize that a packet is a fragment of a larger packet? Based on what we saw in Chapter 17, we need both the Offset and MF fields of the IP header to tell. If the packet has not been fragmented, Offset=0 and MF=0. If instead we have a fragment on our hands, the following is true:

  • 第一个片段具有Offset=0 和MF=1。

  • The first fragment has Offset=0 and MF=1.

  • 第一个和最后一个之间的所有片段的两个字段都非零。

  • All the fragments between the first and the last one have both of the fields nonzero.

  • 最后一个片段具有MF=0 且Offset非零。

  • The last fragment has MF=0 and Offset nonzero.

我们之前说过,这ip_local_deliver是可能发生碎片整理的地方之一。ip_defrag以下是该函数的快照,显示了如何根据刚刚列出的注意事项识别和传递片段:

We said earlier that ip_local_deliver is one of the places where defragmentation could take place. Here is a snapshot from the function that shows how a fragment is recognized and passed to ip_defrag based on the considerations just listed:

        if (skb->nh.iph->frag_off & htons(IP_MF|IP_OFFSET)) {
            skb = ip_defrag(skb);
            如果(!skb)
                返回0;
        }
        if (skb->nh.iph->frag_off & htons(IP_MF|IP_OFFSET)) {
            skb = ip_defrag(skb);
            if (!skb)
                return 0;
        }

类似的逻辑可以在片段代码中找到,以正确标记片段。

Similar logic can be found in fragmentation code to correctly tag fragments.

碎片/碎片整理子系统由 初始化ipfrag_init,并在引导时由 调用inet_init。初始化函数没有做太多事情;它主要启动一个计时器并将一个变量初始化为随机值。这两项任务都需要处理添加的优化,以保护内核免受可能的拒绝服务 (DoS) 攻击;详细信息请参见“哈希表重组”部分。

The fragmentation/defragmentation subsystem is initialized by ipfrag_init, which is invoked at boot time by inet_init. The initialization function does not do much; it mainly starts a timer and initializes one variable to a random value. Both of these tasks are needed to handle an optimization added to protect the kernel from a possible Denial of Service (DoS) attack; see the section "Hash Table Reorganization" for details.

IP碎片化

IP Fragmentation

如第18章18-1所示,该函数被本地生成的数据包和转发的数据包调用,因此下面区域中的函数 可以在两种情况下运行。因此,输入可以是:dst_outputip_fragmentdst_outputip_fragment

As shown in Figure 18-1 in Chapter 18, the dst_output function is called by both locally generated and forwarded packets, so the ip_fragment function in the area below dst_output can run in both situations. Thus, the input to ip_fragment can be:

  • 转发的数据包是完整的

  • Forwarded packets that are whole

  • 转发的数据包已被始发主机或沿途路由器分段

  • Forwarded packets that the originating host or a router along the way has fragmented

  • 由本地函数创建的缓冲区,如上一章所述,已开始分段过程,但尚未添加作为数据包传输所需的标头

  • Buffers created by local functions that, as described in the previous chapter, have started the fragmentation process but have not added the headers that are required for transmission as packets

特别是,ip_fragment必须能够处理以下两种情况:

In particular, ip_fragment must be able to handle both of the following cases:

需要将大块数据拆分成更小的部分
Big chunks of data that need to be split into smaller parts.

分割大缓冲区需要分配新的缓冲区以及从大缓冲区到小缓冲区的内存复制。这当然会影响性能。

Splitting the big buffer requires the allocation of new buffers and memory copies from the big buffer to the small ones. This, of course, impacts performance.

不需要进一步分段的数据片段列表或数组
A list or array of data fragments that do not need to be fragmented further.

如果缓冲区的分配使其有空间允许添加较低层的 L3 和 L2 标头,则ip_fragment无需内存复制即可处理它们。IP 层所需要做的就是向每个片段添加 IP 标头并处理校验和。

If the buffers were allocated such that they have room to allow the addition of lower-layer L3 and L2 headers, ip_fragment can handle them without a memory copy. All the IP layer needs to do is add an IP header to each fragment and handle the checksum.

以前的内核版本用于完全在 IP 层处理 IP 碎片。用于传输数据包的 IP 功能可以接收 0 到 64 KB 之间任意大小的有效负载,并且当数据包大小超过 PMTU 时,必须将该有效负载拆分为多个 IP 片段。我们在第 18 章的“数据包碎片/碎片整理” 部分中看到了这一点。

Previous kernel versions used to handle IP fragmentation entirely at the IP layer. The IP functions used to transmit a packet could receive a payload of any size between 0 and 64 KB, and had to split that payload into multiple IP fragments when the size of the packet exceeded the PMTU. We saw this in the section "Packet Fragmentation/Defragmentation" in Chapter 18.

较新内核使用的方法是让 L4 协议提前协助分段任务:它们可以传递一组适合 PMTU 的缓冲区,而不是向 IP 层传递必须分段的单个缓冲区。这样,在 IP 层处理的 IP 分段只需为已形成的每个数据分段创建 IP 标头即可。这并不意味着 L4 协议实现了 IP 分片;而是意味着 L4 协议实现了 IP 分片。它只是意味着,由于 L4 协议能够识别 IP 碎片,因此它们会尝试进行合作,让 IP 层的工作变得更轻松。L4 协议不触及 IP 标头。

The approach used by newer kernels is to make the L4 protocols aid in the fragmentation task in advance: instead of passing to the IP layer a single buffer that will have to be fragmented, they can pass a set of buffers appropriate to the PMTU. This way, the IP fragmentation handled at the IP layer consists simply of creating an IP header for each data fragment already formed. This does not mean that the L4 protocols implement IP fragmentation; it simply means that since L4 protocols are aware of IP fragmentation, they try to cooperate and make life easier for the IP layer. The L4 protocols do not touch the IP headers.

在介绍第 21 章ip_append_data中讨论的/ip_append_page功能之前,IP 碎片曾经比 IP 碎片整理更简单。现在这两个过程同样复杂。

Before the introduction of the ip_append_data/ip_append_page functions discussed in Chapter 21, IP fragmentation used to be simpler than IP defragmentation. Now both processes are equally complex.

目前可以通过两种方式完成碎片:所谓的快速(或高效)方式和慢速(或旧式)方式。两人都由 照顾ip_fragment。在了解这两种方法有何不同之前,我们先回顾一下对 IP 数据包进行分段所需的主要任务:

Fragmentation can currently be done in two ways: the so-called fast (or efficient) way, and the slow (or old-style) way. Both of them are taken care of by ip_fragment. Before seeing how those two approaches differ, let's review the main tasks required to fragment an IP packet:

  1. 将 L3 有效负载拆分为更小的部分,以适应与用于发送数据包的路由 (PMTU) 关联的 MTU。正如我们稍后将看到的,此任务可能涉及也可能不涉及一些内存副本。如果 IP 有效负载的大小不是片段大小的精确倍数,则最后一个片段将小于其他片段。另外,由于IP报头的分段偏移字段以8字节为单位测量,因此该值与8字节边界对齐。除了最后一个片段之外,每个片段都具有此大小。请参见第 18 章中的图 18-10

  2. Split the L3 payload into smaller pieces to fit within the MTU associated with the route being used to send the packet (PMTU). As we will see in a moment, this task may or may not involve some memory copies. If the size of the IP payload is not an exact multiple of the fragment size, the last fragment is smaller than the others. Also, since the fragment offset field of the IP header is measured in units of 8 bytes, this value is aligned to an 8-byte boundary. Every fragment, with the possible exception of the last one, has this size. See Figure 18-10 in Chapter 18.

  3. 初始化每个片段的 IP 标头,考虑到并非所有选项都必须复制到所有片段中。ip_options_fragment第 18 章“ IP 选项”部分中介绍的,可以完成这项工作。

  4. Initialize each fragment's IP header, taking into account that not all of the options have to be replicated into all of the fragments. ip_options_fragment, introduced in section "IP Options" in Chapter 18, does this job.

  5. 计算 IP 校验和。每个片段都有不同的 IP 标头,因此必须重新计算每个片段的校验和。

  6. Compute the IP checksum. Each fragment has a different IP header, so the checksum has to be recomputed for each one.

  7. 向 Linux 过滤系统 Netfilter 请求完成传输的权限。

  8. Ask Netfilter, the Linux filtering system, for permission to complete the transmission.

  9. 更新所有必要的内核和 SNMP 统计信息(例如IPSTATS_MIB_FRAGCREATESIPSTATS_MIB_FRAGOKSIPSTATS_MIB_FRAGFAILS)。

  10. Update all the necessary kernel and SNMP statistics (such as IPSTATS_MIB_FRAGCREATES, IPSTATS_MIB_FRAGOKS, and IPSTATS_MIB_FRAGFAILS).

在 2.4 之前的内核版本中,名为的函数ip_build_xmit_slow以相反的顺序为本地生成的数据包创建并传输 IP 片段:最后到第一个。这种方法有几个优点:

In kernel versions prior to 2.4, a function named ip_build_xmit_slow created and transmitted IP fragments for locally generated packets in reverse order: last to first. This approach had a couple of advantages:

  • 最后一个片段是唯一可以告诉接收者原始未分段数据包大小的片段。尽快了解这一点可以帮助碎片整理程序更好地处理其内存。

  • The last fragment is the only one that can tell the receiver the size of the original, unfragmented packet. To know this as soon as possible could help the defragmenter handle its memory better.

  • 它使得碎片整理程序更有可能更快地构建数据包。如“ IP 碎片整理”部分中所述,片段ipq按偏移量递增顺序添加到列表 ( ) 中。如果每个片段都在它后面的片段之后到达,则可以将片段快速添加到列表的头部。

  • It makes it more likely that the defragmenter can build up a packet faster. As described in the section "IP Defragmentation," fragments are added into a list (ipq) in increasing order of offset. If each fragment arrives after the fragment that comes after it, fragments can be added speedily at the head of the list.

虽然这种优化在接收器是 Linux 机器时有效,但如果接收器使用做出不同假设的其他操作系统,则它可能没有效果,甚至是一个缺点。[ * ]因此,从2.4开始,Linux内核按正向顺序传输片段。

While this sort of optimization works when the receiver is a Linux box, it might have no effect or even be a drawback if the receiver uses some other operating system that makes different assumptions.[*] Therefore, starting with 2.4, the Linux kernel transmits fragments in forward order.

IP分片涉及的功能

Functions Involved with IP Fragmentation

上一章描述了在 IP 层传输数据的函数,涵盖了为分段做大量基础工作的ip_append_data/函数集。ip_append_page本节的其余部分重点关注ip_fragment,它将等待传输的缓冲区转换为实际的数据包。

The previous chapter, which described the functions that transmit data at the IP layer, covered the ip_append_data/ip_append_page set of functions that do a lot of the groundwork for fragmentation. The rest of this section focuses on ip_fragment, which turns the buffers waiting for transmission into actual packets.

以下是碎片代码使用的几个支持例程:

Here are a couple of support routines used by the fragmentation code:

ip_dont_fragment
ip_dont_fragment

根据路径 MTU 发现配置决定 IP 数据包是否可以分段(请参阅第 18 章中的“路径 MTU 发现”部分)。

Decides whether the IP packet can be fragmented, based on Path MTU discovery configuration (see the section "Path MTU Discovery" in Chapter 18).

ip_options_fragment
ip_options_fragment

修改第一个分片的IP头,使其可以被后面的分片回收。请参阅第 19 章中的“ IP 选项”部分。

Modifies the IP header of the first fragment so that it can be recycled by the following ones. See the section "IP Options" in Chapter 19.

ip_dont_fragment和分别在include/net/ip.hnet/ipv4/ip_options.cip_options_fragment中定义。

ip_dont_fragment and ip_options_fragment are defined in include/net/ip.h and net/ipv4/ip_options.c, respectively.

ip_fragment 函数

The ip_fragment Function

我们在上一节中已经提到ip_fragment可以通过两种不同的方式处理碎片。我们先看看公共部分是做什么的。在接下来的两节中,我们将分别分析这两种情况。

We already mentioned in the previous section that ip_fragment can take care of fragmentation in two different ways. Let's first see what the common part does. In the next two sections, we will analyze the two cases separately.

    int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff*))
    int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff*))

该函数的输入参数的含义如下:

Here are the meanings of the function's input parameters:

skb
skb

包含要分段的 IP 数据包的缓冲区。该数据包包含一个已初始化的 IP 标头,必须对其进行调整并复制到所有片段中。有关示例, 请参见第 21 章 中的图 21-12(b)skb

Buffer containing the IP packet to fragment. The packet includes an already initialized IP header, which will have to be adapted and replicated into all the fragments. See Figure 21-12(b) in Chapter 21 for an example of what skb may look like.

output
output

用于传输片段的函数。在第18章图18-1中,你可以看到一些被调用的地方。您可以检查它们以了解用作什么函数(例如,使用)。ip_fragmentoutputip_outputip_finish_output

Function to use to transmit the fragments. In Figure 18-1 in Chapter 18, you can see some of the places where ip_fragment is called. You can check them to see what function is used as output (for example, ip_output uses ip_finish_output).

ip_fragment首先初始化一些稍后将使用的关键变量。它从通过输入参数获得的设备和 IP 标头结构中提取它们的值skb 。出口设备dev和 PMTU mtu是从用于传输数据包的路由条目中提取的 ( rt)。您将在第 36 章中看到该数据结构中还保存了哪些其他参数。

ip_fragment begins by initializing a few key variables that will be used later. It extracts their values from the device and IP header structures that are obtained via the input skb parameter. The egress device dev and the PMTU mtu are extracted from the routing entry used to transmit the packet (rt). You will see in Chapter 36 what other parameters are kept in that data structure.

如果由于源设置了 DF 标志而无法对输入 IP 数据包进行分段, ip_fragment则将 ICMP 数据包发送回源以通知其问题,然后丢弃该数据包。当虚拟服务器代码不希望刚刚描述的条件生成 ICMP 消息时,主要由虚拟服务器代码设置条件中显示的local_df标志。if

If the input IP packet cannot be fragmented because the source has set the DF flag, ip_fragment sends an ICMP packet back to the source to notify it of the problem, and then drops the packet. The local_df flag shown in the if condition is set mainly by the Virtual Server code when it does not want the condition just described to generate an ICMP message.

    dev = rt->u.dst.dev;
        iph = skb->nh.iph;

        if (不太可能((iph->frag_off & htons(IP_DF)) && !skb->local_df)) {
            icmp_send(skb,ICMP_DEST_UNREACH,ICMP_FRAG_NEEDED,
                  htonl(dst_pmtu(&rt->u.dst)));
            kfree_skb(skb);
            返回-EMSGSIZE;
        }

        hlen = iph->ihl * 4;
        mtu = dst_mtu(&rt->u.dst) - hlen;
    dev = rt->u.dst.dev;
        iph = skb->nh.iph;

        if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) {
            icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
                  htonl(dst_pmtu(&rt->u.dst)));
            kfree_skb(skb);
            return -EMSGSIZE;
        }

        hlen = iph->ihl * 4;
        mtu = dst_mtu(&rt->u.dst) - hlen;

ip_fragment 当接收到sk_buff的数据已经分片时,使用快速分片。例如,对于使用ip_append_dataip_push_pending_frames函数的 L4 协议本地生成的数据包,这是可能的。由使用该功能的 L4 协议生成的数据包也是可能的ip_queue_xmit,因为它们负责自行创建片段。参见第 21 章

Fast fragmentation is used when ip_fragment receives an sk_buff whose data is already fragmented. This is possible, for example, for packets locally generated by an L4 protocol that uses the ip_append_data and ip_push_pending_frames functions. It is also possible for packets generated by L4 protocols that use the ip_queue_xmit function, because they take care of creating fragments themselves. See Chapter 21.

慢速路径用于所有其他情况,其中我们有:

The slow path is used in all the other cases, among which we have:

  • 正在转发的数据包

  • Packets being forwarded

  • 本地产生的流量在到达之前尚未分片dst_output

  • Locally generated traffic that has not been fragmented before reaching dst_output

  • 由于对缓冲区进行健全性检查而禁用快速碎片的所有情况(请参阅 的开头ip_fragment

  • All of those cases where fast fragmentation was disabled due to a sanity check on the buffers (see the beginning of ip_fragment)

即使ip_fragment给定的缓冲区的数据已经被分解为片段大小的缓冲区作为输入,由于片段组织中的错误,也可能无法使用快速路径。错误可能是由执行错误缓冲区操作的损坏功能或 IPsec 协议使用的转换器引起的。

Even if ip_fragment was given a buffer whose data was already broken into fragment-size buffers as input, it may not be possible to use the fast path due to an error in the organization of the fragments. An error could be caused by a broken feature that performs a faulty buffer manipulation, or by the transformers used by the IPsec protocols.

在两种情况下(慢速和快速),如果任何片段传输失败,则ip_fragment立即返回错误代码,并且不传输后续片段。发生这种情况时,目标主机将仅收到 IP 片段的子集,因此无法重新组装它们。

In both cases (slow and fast), if any of the fragment transmission fails, ip_fragment returns immediately with an error code and the following fragments are not transmitted. When this happens, the destination host will receive only a subset of the IP fragments and therefore will fail to reassemble them.

缓慢的碎片化

Slow Fragmentation

ip_append_page与/协作完成的快速碎片不同ip_append_data,慢速碎片不需要保留任何状态信息(如分片列表等)。该过程仅包括将 IP 数据包分割成片段,片段的大小由传出接口的 MTU 指定,或者在启用路径 MTU 发现的情况下由与所使用的路由关联的 MTU 指定。

Unlike the fast fragmentation done in collaboration with ip_append_page/ip_append_data, slow fragmentation does not need to keep any state information (such as the list of fragments, etc.). The process simply consists of splitting the IP packet into fragments whose size is given by the MTU of the outgoing interface, or by the MTU associated with the route used if path MTU discovery is enabled.

在进入循环之前,函数需要初始化一些局部变量。

Before entering the loop, the function needs to initialize a few local variables.

ptr是要分段的数据包的偏移量;它将随着碎片的进行而移动。left被初始化为IP数据包的长度。在计算 时left,该ip_fragment 函数减去hlen(L2 标头长度),因为该组件不是 IP 有效负载的一部分,并且该函数必须为其留出空间,因为它将被复制到每个片段中。

ptr is the offset into the packet about to be fragmented; it will be moved as fragmentation proceeds. left is initialized to the length of the IP packet. In calculating left, the ip_fragment function subtracts hlen (the L2 header length) because that component is not part of the IP payload and the function must leave room for it because it will be copied into each fragment.

IP 标头将片段偏移量以及 DF 和 MF 标志一起放置在单个 16 位字段中。以下代码中的offset公式从中提取字段。

The IP header places the fragment offset and the DF and MF flags together in a single 16-bit field. The formula in the following code extracts the offset field from it.

顾名思义not_last_frag,当数据包中的当前片段后面应该有更多数据时,局部变量 为 true。这是一个重要的数据位,因为数据包中的最后一个片段指示了数据包的大小,这对于有效分配内存来说是有价值的信息;该函数稍后会对该信息起作用。然而,该not_last_frag变量并未在数据包内的第一个片段(即原始数据包)上设置,例如,如果一个数据包被分割成两部分,并且稍后对第二部分进行分割,则第二部分中的所有片段都将被设置。已not_last_frag设置变量)。

The local variable not_last_frag, as the name suggests, is true when more data is supposed to follow the current fragment in the packet. This is an important bit of data because the last fragment in the packet indicates the size of the packet, which is valuable information for allocating memory efficiently; the function acts on this information later. The not_last_frag variable is not set, however, on the first fragment within the packet (that is, the original packet—if a packet is fragmented into two pieces, for example, and the second piece is later fragmented, all fragments in the second piece will have the not_last_frag variable set).

        左 = skb->len - hlen;
        ptr = 原始 + hlen;

        偏移量 = (ntohs(iph->frag_off) & IP_OFFSET) << 3;
        not_last_frag = iph->frag_off & htons(IP_MF);
        left = skb->len - hlen;
        ptr = raw + hlen;

        offset = (ntohs(iph->frag_off) & IP_OFFSET) << 3;
        not_last_frag = iph->frag_off & htons(IP_MF);

ip_fragmentnext 开始一个循环,为每个片段创建一个新的缓冲区 ( skb2)。输入参数 skb包含原始IP数据包。

ip_fragment next starts a loop to create a new buffer for each fragment (skb2). The input parameter skb contains the original IP packet.

        而(左> 0){
            len = 左;
        while(left > 0) {
            len = left;

对于每个片段,长度设置为之前通过 PMTU 字段定义的 MTU 值。片段的大小也与 IP RFC 规定的 8 字节边界对齐。不满足以下条件的唯一情况是当我们传输最后一个片段或不需要分段时。但第二种情况永远不应该发生,因为如果不需要碎片,该函数首先就不会执行。

For each fragment, the length is set to the MTU value defined earlier through the PMTU field. The size of the fragment is also aligned to an 8-byte boundary, as imposed by the IP RFC. The only cases where the following condition is not met are when we are transmitting the last fragment or when fragmentation is not needed. But the second case should never occur because if fragmentation were not needed, the function would not execute in the first place.

            如果(长度 > MTU)
                len = mtu;

            if (len < 左) {
                长度 &= ~7;
            }
            if (len > mtu)
                len = mtu;

            if (len < left) {
                len &= ~7;
            }

分配用于保存片段的缓冲区大小是以下各项的总和:

The size of the buffer allocated to hold a fragment is the sum of:

  • IP 有效负载的大小

  • The size of the IP payload

  • IP 标头的大小

  • The size of the IP header

  • L2标头的大小

  • The size of the L2 header

最后一个值在循环之前初始化while,并从路由表缓存中检索。IP层可以从路由表中获知用于传输分段的L2设备。该 ip_fragment函数可以从关联的数据结构中提取与设备协议关联的标头的大小net_device。该值通过宏与 16 字节边界对齐LL_RESERVED_SPACE[_EXTRA] ,并存储在局部变量ll_rs (链路层保留空间)中。此对齐与刚刚对有效负载执行的 8 字节对齐无关。当内核编译为支持 L2 防火墙(即CONFIG_BRIDGE_NETFILTER内核选项)时,ll_rs并且mtu相应更新以适应可能的 802.1Q 标头。

The last of those values is initialized just before the while loop and is retrieved from the routing table cache. The IP layer can learn, from the routing table, the L2 device to be used to transmit the fragments. The ip_fragment function can extract the size of the header associated with the device's protocol from the associated net_device data structure. This value is aligned to a 16-byte boundary by the LL_RESERVED_SPACE[_EXTRA] macros and is stored in the local variable ll_rs (Link Layer Reserved Space). This alignment has nothing to do with the 8-byte alignment just performed on the payload. When the kernel is compiled with support for L2 firewalling (i.e., the CONFIG_BRIDGE_NETFILTER kernel option), ll_rs and mtu are updated accordingly to accommodate a possible 802.1Q header.

            if ((skb2 = alloc_skb(len+hlen+ll_rs,
                                     GFP_ATOMIC)) == NULL) {
                NETDEBUG(printk(KERN_INFO "IP: frag: 没有新片段的内存!\n"));
                错误=-ENOMEM;
                走向失败;
            }
            if ((skb2 = alloc_skb(len+hlen+ll_rs,
                                     GFP_ATOMIC)) == NULL) {
                NETDEBUG(printk(KERN_INFO "IP: frag: no memory for new fragment!\n"));
                err = -ENOMEM;
                goto fail;
            }

现在,该函数需要将正在复制的结构(原始 IP 数据包)skb2中的一些字段的值复制到新分配的缓冲区中。sk_buff其中一些被复制到这里,另一些则由 处理ip_copy_metadata,它也可能根据特定功能(例如流量控制和 Netfilter)是否内置到内核中来复制一些字段。指向 L3 ( nh.raw) 和 L4 ( n.raw) 标头的指针也被初始化。

Now the function needs to copy into the newly allocated buffer skb2 the value of a few fields from the sk_buff structure (the original IP packet) being replicated. Some of them are copied here, and others are taken care of by ip_copy_metadata, which also may copy some fields based on whether specific features (such as Traffic Control and Netfilter) are built into the kernel. The pointers to the L3 (nh.raw) and L4 (n.raw) headers are also initialized.

            ip_copy_metadata(sk2, skb);
            skb_reserve(skb2, ll_rs);
            skb_put(skb2, len + hlen);
            skb2->nh.raw = skb2->数据;
            skb2->h.raw = skb2->data + hlen;
            ip_copy_metadata(sk2, skb);
            skb_reserve(skb2, ll_rs);
            skb_put(skb2, len + hlen);
            skb2->nh.raw = skb2->data;
            skb2->h.raw = skb2->data + hlen;

新分配的缓冲区与尝试传输的套接字(如果有)相关联。(例如,使用第 18 章18-1左侧的功能请求发送时就是这种情况。)

The newly allocated buffer is associated with the socket attempting the transmission, if any. (This is the case, for instance, when the transmission was requested with the functions on the left side of Figure 18-1 in Chapter 18.)

            如果(skb->sk)
                skb_set_owner_w(skb2, skb->sk);
            if (skb->sk)
                skb_set_owner_w(skb2, skb->sk);

现在是时候用skb2 一些实际数据填充新缓冲区了。(到目前为止,该函数只处理结构的管理字段sk_buff。)这分两部分完成:

Now it is time to fill in the new buffer skb2 with some real data. (So far the function has taken care of only the management fields of the sk_buff structure.) This is done in two parts:

  • IP 标头通过简单的memcpy.

  • The IP header is copied with a simple memcpy.

  • 然后,将原始数据包中的一段有效负载复制到片段中。

  • Then a piece of payload from the original packet is copied into the fragment.

后一个任务不能使用简单的memcpy,因为数据可以skb使用片段列表或内存页扩展以多种方式存储(参见第 21 章)。当数据包在 指向的内存区域中包含其所有数据时skb->data(参见第 21 章中的图 21-2),或者当数据在到达之前已被分段 但前面描述的健全性检查之一排除了这种情况时,可以调用慢速路径快速路径。处理数据布局的各种可能性的逻辑位于辅助函数中,该函数调用。ip_fragmentskb_copy_bitsip_fragment

The latter task cannot use a simple memcpy, because the data may be stored in skb in a variety of ways using a list of fragments or memory page extensions (see Chapter 21). The slow path could be invoked when a packet contains all its data in the memory area pointed to by skb->data (see Figure 21-2 in Chapter 21), or when data has already been fragmented before reaching ip_fragment but one of the sanity checks described earlier rules out the fast path. The logic to handle the various possibilities for data layout is in the helper function skb_copy_bits, which ip_fragment calls.

            memcpy(skb2->nh.raw, skb->data, hlen);

            if (skb_copy_bits(skb, ptr, skb2->h.raw, len))
                漏洞( );

            左-= 长度;
            iph = skb2->nh.iph;
            iph->frag_off = htons((偏移>> 3));
            memcpy(skb2->nh.raw, skb->data, hlen);

            if (skb_copy_bits(skb, ptr, skb2->h.raw, len))
                BUG( );

            left -= len;
            iph = skb2->nh.iph;
            iph->frag_off = htons((offset >> 3));

从 IP 选项的角度来看,第一个片段(其中offset0)很特殊,因为它是唯一包含原始 IP 数据包选项的完整副本的片段。并非所有选项都必须复制到所有片段中;只有第一个片段将包含所有这些。

The first fragment (where offset is 0) is special from the IP options point of view because it is the only one that includes a full copy of the options from the original IP packet. Not all the options have to be replicated into all of the fragments; only the first fragment will include all of them.

            如果(偏移量==0)
                ip_options_fragment(skb);
            if (offset == 0)
                ip_options_fragment(skb);

ip_options_fragment,如第 19 章所述,清理 ip_opt与原始 IP 数据包相关的结构内容,以便第一个数据包后面的片段不会包含不需要的选项。因此,ip_options_fragment仅在处理第一个片段(即带有offset= 的片段0)期间调用。

ip_options_fragment, described in Chapter 19, cleans up the content of the ip_opt structure associated with the original IP packet so that fragments following the first one will not include options they do not need. Therefore, ip_options_fragment is called only during the processing of the first fragment (which is the one with offset=0).

如果满足以下任一条件,则设置 MF 标志(用于更多片段):

The MF flag (for More Fragments) is set if either of the following conditions is met:

  • 被分片的数据包本身并不是一个分片,并且在此循环中创建的分片也不是最后一个分片 ( left> 0)。

  • The packet being fragmenting is not a fragment itself, and the fragment created in this loop is not the last one (left>0).

  • 被分片的数据包本身就是一个分片,但不是最后一个分片,因此它的所有分片都必须设置 MF ( not_last_frag= 1)。

            如果(左> 0 || not_last_frag)
                iph->frag_off |= htons(IP_MF);
  • The packet being fragmented is a fragment itself, but is not the last one, and therefore all of its fragments must have MF set (not_last_frag=1).

            if (left > 0 || not_last_frag)
                iph->frag_off |= htons(IP_MF);

以下两条语句更新两个偏移量。两者很容易混淆。 offset被维持,因为当前被分片的数据包可能是较大数据包的片段;如果是,offset则表示当前片段在原始数据包中的偏移量(否则,它只是 0)。ptr是我们正在分段的数据包内的偏移量,并随着循环的进行而变化。这两个变量在两种情况下具有相同的值:我们正在分段的数据包本身不是分段,以及该分段是第一个分段。

The following two statements update two offsets. It is easy to confuse the two. offset is maintained because the packet currently being fragmented may be a fragment of a larger packet; if so, offset represents the offset of the current fragment within the original packet (otherwise, it is simply 0). ptr is an offset within the packet we are fragmenting and changes as the loop progresses. The two variables have the same value in two cases: where the packet we are fragmenting is not a fragment itself, and where this fragment is the very first fragment.

            ptr += len;
            偏移量+=长度;
            ptr += len;
            offset += len;

最后,慢速路径需要更新标头长度(考虑到选项的大小),使用 计算校验和,并使用作为参数传递的函数ip_send_check传输片段。IPv4使用的功能是output(参见第18章18-1)。outputip_finish_output

Finally, the slow path needs to update the header length (taking into account the size of the options), compute the checksum with ip_send_check, and transmit the fragment using the output function passed as a parameter. The output function used by IPv4 is ip_finish_output (see Figure 18-1 in Chapter 18).

            iph->tot_len = htons(len + hlen);
            ip_send_check(iph);

            错误=输出(skb2);
            iph->tot_len = htons(len + hlen);
            ip_send_check(iph);

            err = output(skb2);

快速碎片化

Fast Fragmentation

ip_fragment 当发现frag_list 输入skb缓冲区的指针不为 NULL 时尝试快速路径。然而,正如本章前面所述,必须确保片段适合快速路径。以下是与协议要求相关的健全性检查:

ip_fragment tries the fast path when it sees that the frag_list pointer of the input skb buffer is not NULL. However, as described earlier in this chapter, it must make sure that the fragments are suitable for the fast path. Here are the sanity checks related to protocol requirements:

  • 每个片段的大小不应超过 PMTU。

  • The size of each fragment should not exceed the PMTU.

  • 只有最后一个片段可以具有大小不是八字节倍数的 L3 有效负载。

  • Only the last fragment can have an L3 payload whose size is not a multiple of eight bytes.

  • 每个片段的头部必须有足够的空间,以便稍后添加 L2 标头。

  • Each fragment must have enough space at the head to allow the addition of an L2 header later.

还有一些其他缓冲区管理检查:

And there are some other buffer management checks as well:

  • 该片段无法共享,因为这将禁止ip_fragment对其进行编辑以添加 IP 标头。ip_fragment使用慢速路径时接收共享缓冲区是可以接受的,因为该缓冲区将被复制到许多其他新缓冲区中,但对于快速路径来说这是不可接受的。

        if (skb_shinfo(skb)->frag_list) {
            结构体sk_buff*frag;
            int first_len = skb_pagelen(skb);
    
            if (first_len - hlen > mtu ||
                ((first_len - hlen) & 7) ||
                (iph->frag_off & htons(IP_MF|IP_OFFSET)) ||
                skb_克隆(skb))
                转到慢速路径;
    
            for (frag = skb_shinfo(skb)->frag_list; frag; frag = frag->next) {
                if (frag->len > mtu ||
                    ((frag->len & 7) && frag->next) ||
                    skb_headroom(frag) < hlen)
                    转到慢速路径;
    
                如果(skb_shared(片段))
                    转到慢速路径;
                ...
            }
  • The fragment cannot be shared, because that would forbid ip_fragment from editing it to add the IP header. It is acceptable for ip_fragment to receive a shared buffer when using the slow path because the buffer is going to be copied into many other new buffers, but it is not acceptable for the fast path.

        if (skb_shinfo(skb)->frag_list) {
            struct sk_buff *frag;
            int first_len = skb_pagelen(skb);
    
            if (first_len - hlen > mtu ||
                ((first_len - hlen) & 7) ||
                (iph->frag_off & htons(IP_MF|IP_OFFSET)) ||
                skb_cloned(skb))
                goto slow_path;
    
            for (frag = skb_shinfo(skb)->frag_list; frag; frag = frag->next) {
                if (frag->len > mtu ||
                    ((frag->len & 7) && frag->next) ||
                    skb_headroom(frag) < hlen)
                    goto slow_path;
    
                if (skb_shared(frag))
                    goto slow_path;
                ...
            }

第一个分片的IP头的初始化是在循环外完成的,因为它可以稍微优化。例如,当这个函数运行时,它知道至少有两个片段,因此不需要检查frag->next第一个片段来初始化iph->frag_off:作为第一个片段,该片段必须设置IP_MF标志和其余的偏移量设置为 0 ( iph->frag_off = IP_MF)。其他数据包必须设置该IP_MF位,frag_off而不会干扰值的其余部分 ( iph->frag_off |= IP_MF)。

The initialization of the IP header of the first fragment is completed outside the loop because it can be optimized slightly. For instance, when this function runs, it knows there are at least two fragments, and therefore it does not need to check frag->next on the first fragment to initialize iph->frag_off: as the first fragment, this fragment must have the IP_MF flag set and the rest of the offset set to 0 (iph->frag_off = IP_MF). The other packets must have the IP_MF bit set in frag_off without disturbing the rest of the value (iph->frag_off |= IP_MF).

假设可以使用快速路径。其余的代码非常简单,在某种程度上它与慢速路径的代码类似。在发送第一个片段之后(即,在块的第一个循环之后for),IP标头被修改ip_options_fragment,以便它可以被后续片段回收。如果我们排除这种特殊情况,那么传输片段所需要做的就是:

Let's suppose the fast path can be used. The rest of the code is pretty simple, and to some extent it is similar to the code seen for the slow path. After the first fragment has been sent (i.e., after the first loop of the for block), the IP header is modified with ip_options_fragment so that it can be recycled by the following fragments. If we exclude that special case, all we need to do to transmit a fragment is:

  • 将(修改后的)标头从第一个 IP 片段复制到当前片段中。

  • Copy the (modified) header from the first IP fragment into the current fragment.

  • 初始化 IP 标头中可能不同的那些字段。其中包括偏移量和 IP 校验和,由 计算得出ip_send_check。另外,如果该片段不是最后一个片段,则设置 MF 标志。

  • Initialize those fields of the IP header that may differ. Among them are the offset and the IP checksum, which is computed with ip_send_check. Also, if the fragment is not the last one, set the MF flag.

  • sk_buff使用 .将其余字段从第一个片段复制到当前片段ip_copy_metadata。这些字段是管理参数;它们与IP片段的内容没有任何关系。

  • Copy from the first fragment to the current fragment the rest of the sk_buff fields, using ip_copy_metadata. These fields are management parameters; they do not have anything to do with the content of the IP fragment.

  • output传输带有作为参数传递的函数的片段。

  • Transmit the fragment with the function output passed as a parameter.

如果出现错误,frag_list将释放所有后续片段的内存(在以下快照中未显示)。请注意,块内的代码if (frag) {...}准备了将在接下来的循环迭代中传输的片段,并且对output传输当前片段的调用。

In case of errors, memory for all the subsequent fragments in frag_list is freed (not shown in the following snapshot). Note that the code inside the if (frag) {...} block prepares the fragment that will be transmitted in the following loop iteration, and the call to output transmits the current one.

            skb->data_len = first_len - skb_headlen(skb);
            skb->len =first_len;
            iph->tot_len = htons(first_len);
            iph->frag_off = htons(IP_MF);
            ip_send_check(iph);

            为了 (;;) {
                如果(碎片){
                    frag->ip_summed = CHECKSUM_NONE;
                    碎片->h.raw = 碎片->数据;
                    frag->nh.raw = _ _skb_push(frag, hlen);
                    memcpy(frag->nh.raw, iph, hlen);
                    iph = 碎片->nh.iph;
                    iph->tot_len = htons(frag->len);

                    ip_copy_metadata(frag, skb);
                    如果(偏移量==0)
                        ip_options_fragment(片段);
                    偏移+= skb->len - hlen;
                    iph->frag_off = htons(偏移>>3);
                    if (frag->next != NULL)
                        iph->frag_off |= htons(IP_MF);
                    ip_send_check(iph);
                }

                错误=输出(skb);

                if (错误 || !frag)
                    休息;

                skb = 碎片;
                frag = skb->下一个;
                skb->下一个= NULL;
            }
            skb->data_len = first_len - skb_headlen(skb);
            skb->len = first_len;
            iph->tot_len = htons(first_len);
            iph->frag_off = htons(IP_MF);
            ip_send_check(iph);

            for (;;) {
                if (frag) {
                    frag->ip_summed = CHECKSUM_NONE;
                    frag->h.raw = frag->data;
                    frag->nh.raw = _ _skb_push(frag, hlen);
                    memcpy(frag->nh.raw, iph, hlen);
                    iph = frag->nh.iph;
                    iph->tot_len = htons(frag->len);

                    ip_copy_metadata(frag, skb);
                    if (offset == 0)
                        ip_options_fragment(frag);
                    offset += skb->len - hlen;
                    iph->frag_off = htons(offset>>3);
                    if (frag->next != NULL)
                        iph->frag_off |= htons(IP_MF);
                    ip_send_check(iph);
                }

                err = output(skb);

                if (err || !frag)
                    break;

                skb = frag;
                frag = skb->next;
                skb->next = NULL;
            }

IP碎片整理

IP Defragmentation

显然,当数据包到达最终目的地并必须传递到上层网络层(在 Linux 中,由该函数处理ip_local_deliver)时,需要进行碎片整理。相比之下,路由器通常只是传递数据包,而不关心它们是否是较大数据包的片段。但碎片整理有时在路由器上可能需要:一般来说,每当主机必须对整个数据包进行某些处理时,就需要进行碎片整理。路由器上的两个这样的情况是:

Defragmentation is needed, obviously, when a packet has reached its final destination and has to be passed to an upper network layer (in Linux, it is handled by the ip_local_deliver function). Routers, by contrast, usually just pass packets through without caring whether they are fragments of a larger packet. But defragmentation can sometimes be required on a router: generally speaking, defragmentation is needed whenever a host has to do some processing on the entire packet. Two such cases on routers are:

  • IP 标头包含 Router Alert 选项,该选项强制路由器处理数据包(请参阅ip_call_ra_chain,调用自 ip_forward,以及第 18 章中的图 18-1)。

  • The IP header contains the Router Alert option, which forces the router to process the packet (see ip_call_ra_chain, called from ip_forward, and Figure 18-1 in Chapter 18).

  • Netfilter 必须查看数据包来决定如何处理它。给定第18章图18-1中的方案,Netfilter可能强制进行碎片整理的挂钩点是和 。NF_IP_PRE_ROUTINGNF_IP_LOCAL_OUT

  • Netfilter has to look at the packet to decide what to do with it. Given the scheme in Figure 18-1 in Chapter 18, the hook points where Netfilter may force defragmentation are NF_IP_PRE_ROUTING and NF_IP_LOCAL_OUT.

但碎片整理的工作方式并不取决于触发它的环境,因此我将从高层次的角度描述实现方式。

But the way defragmentation works does not depend on the circumstances in which it is triggered, so I will describe the implementation from a high-level standpoint.

IP 片段哈希表的组织

Organization of the IP Fragments Hash Table

当接收到 IP 片段时,它们被组织成struct ipq元素的哈希表。图 22-1显示了如何组织和使用数据结构的示例。

As IP fragments are received, they are organized into a hash table of struct ipq elements. Figure 22-1 shows an example of how the data structure is organized and used.

    #定义IPQ_HAHSSZ 64
    静态结构 ipq *ipq_hash[IPQ_HASHSZ];
    #define IPQ_HASHSZ    64
    static struct ipq *ipq_hash[IPQ_HASHSZ];

每个正在整理碎片的 IP 数据包都由一个实例表示ipq,该实例由碎片列表组成。图 22-1显示了 ID 为 1234 的 IP 数据包示例,到目前为止仅收到了两个分片。在图的底部,您可以看到这两个片段在原始 IP 数据包中的位置,该数据包长度为 1,250 字节。该图显示了所涉及的数据结构中一些最重要字段的角色。

Each IP packet being defragmented is represented by an ipq instance, which consists of a list of fragments. Figure 22-1 shows an example of an IP packet with ID 1234, for which only two fragments have been received so far. At the bottom of the figure you can see where those two fragments fit into the original IP packet, which is 1,250 bytes in length. The figure shows the roles of some of the most important fields in the data structures involved.

在图的底部附近,您可以看到每个片段的偏移量存储在cb每个片段内称为的字段中sk_buff我们在第 2 章中看到,该字段是一个缓冲区,可供各个网络层用来存储私有信息。存储在该缓冲区中的数据可能会根据缓冲区是正在接收还是正在发送而改变。

Near the bottom of the figure you can see that an offset for each fragment is stored in a field called cb within each sk_buff. We saw in Chapter 2 that this field is a buffer that can be used by the various network layers to store private information. The data stored in that buffer may change depending on whether the buffer is being received or transmitted.

在 IP 碎片整理的上下文中,IP 使用该sk_buff->cb字段来存储ipfrag_skb_cb结构,而该结构又是 的简单包装器inet_skb_parm,该结构用于存储 IP 选项和标志。(更高层也通常使用相同的结构。)添加的新字段ipfrag_skb_cb是片段位于原始 IP 数据包内的偏移量。该数据结构可以使用net/ipv4/ip_fragment.cFRAG_CB中定义的宏来访问 。因此,IP 层用于碎片整理和(在include/net/ip.h中定义)FRAG_CBIPCB) 出于任何其他目的访问选项;它们指向具有不同名称但最终指向内存中相同位置的数据结构。

In the context of IP defragmentation, IP uses the sk_buff->cb field to store an ipfrag_skb_cb structure, which in turn is a simple wrapper for inet_skb_parm, the structure used to store IP options and flags. (The same structure is commonly used by higher layers, too.) The new field added in ipfrag_skb_cb is the offset the fragment lies at inside the original IP packet. That data structure can be accessed with the macro FRAG_CB, defined in net/ipv4/ip_fragment.c. Thus, the IP layer uses FRAG_CB for the purpose of defragmentation and IPCB (defined in include/net/ip.h) for accessing the options for any other purpose; they point to data structures with different names but ultimately to the same locations in memory.

ipq_hash表受 保护ipfrag_lock,可以采用共享(只读)或独占(读写)模式。不要将此锁与每个 ipq元素中嵌入的锁混淆。

The ipq_hash table is protected by ipfrag_lock, which can be taken either in shared (read-only) or exclusive (read-write) mode. Do not confuse this lock with the one embedded in each ipq element.

碎片整理的关键问题

Key Issues in Defragmentation

当您阅读本节的其余部分时,它将帮助您记住使碎片整理变得复杂的限制:

As you read the rest of this section, it will help you to keep in mind the constraints that make defragmentation complex:

  • 碎片必须存储在内核内存中,直到它们被网络子系统完全处理,而内存是昂贵的。因此,必须有一种方法来限制内存的使用。

  • Fragments must be stored in kernel memory until they are totally processed by the network subsystem, and memory is expensive. Therefore, there must be a way to limit memory use.

  • 存储大量信息最有效的结构(想象一下每秒通过数百万个数据包的路由器)是哈希表。然而,哈希表可能会变得不平衡,特别是如果恶意攻击者找出哈希算法并故意尝试降低哈希表的特定元素以减慢处理速度。在“哈希表重组”部分中,我们将看到 Linux 如何使哈希算法使用额外的随机组件来使给定输入产生的输出更难以预测。

    用于存储IP分片的结构

    图 22-1。用于存储IP分片的结构

  • The most efficient structure for storing large amounts of information (just think of a router passing through millions of packets per second) is a hash table. A hash table can become unbalanced, however, particularly if malicious attackers figure out the hash algorithm and deliberately try to weigh down particular elements of the hash table to slow down processing. In the section "Hash Table Reorganization" we will see how Linux makes the hash algorithm use an additional random component to make the output produced by a given input less predictable.

    Figure 22-1. Structure used to store IP fragments

  • 网络通常使用不可靠的媒体,因此片段可能会丢失。这是特别正确的,因为数据包内的不同片段可能沿着不同的路径传播。因此,IP 层必须在每个数据包上维护一个计时器,并在某个时刻放弃,丢弃收到的任何片段。还必须使用校验和来最大限度地提高检测到损坏的机会。

  • Networking often uses unreliable media, so fragments can be lost. This is particularly true because different fragments within a packet may travel along different paths. Therefore, the IP layer must maintain a timer on each packet and give up at some point, throwing away any fragments received. Checksums must also be employed to maximize the chance that corruption will be detected.

  • 如果源主机在一定时间后没有收到某些数据的确认并且传输协议实施流量控制,则它会重传该数据。因此,单个 IP 数据包的目的地可能会接收到多个重叠片段。为了使这个问题更加复杂,第二个 IP 数据包可能会通过与第一个 IP 数据包不同的路径,因此会以不同的方式进行分段,因此分段之间的边界可能不匹配。我们在第18章看到当IP数据包重新传输时,它会被赋予一个新的IP ID,这有助于减少出现此问题的可能性。不幸的是,正如我们在同一章中看到的,IP ID 可以在快速网络中快速环绕,因此混合来自不同 IP 数据报的片段的问题仍然存在。对于IP 协议用于将IP 片段与IP detagram 相关联的标准,请参阅第18 章中的“将片段与其IP 数据包相关联”部分。

  • If a source host does not receive acknowledgment for some data after a certain amount of time and the transport protocol implements flow control, it retransmits the data. Therefore, multiple overlapping fragments may be received at the destination for a single IP packet. To make this problem more complex, the second IP packet may travel a different path from the first and therefore be fragmented differently, so the boundaries between fragments might not match up. We saw in Chapter 18 that when an IP packet is retransmitted, it is given a new IP ID, which helps reduce the likelihood of this problem. Unfortunately, as we saw in the same chapter, the IP ID can wrap around quickly in a fast network, so the problem of mixing fragments from different IP datagrams still exists. For the criteria used by the IP protocol to associate IP fragments to IP detagrams, please refer to the section "Associating fragments with their IP packets" in Chapter 18.

这些要求共同导致了以下几页中描述的实现。片段存储在哈希表中,通过将随机分量引入传递给哈希函数的输入中,定期对哈希表进行加扰(更多详细信息请参阅“哈希表重组”部分)。每个数据包都与一个计时器相关联,如果计时器到期,则将其删除。检查每个片段是否损坏以及是否与之前接收到的片段重叠。

Together, these requirements lead to the implementation described on the following pages. Fragments are stored in a hash table that is periodically scrambled by introducing a random component into the input passed to the hash function (more details in the section "Hash Table Reorganization"). Each packet is associated with a timer, and is removed if the timer expires. Each fragment is checked for corruption and for overlaps with fragments received earlier.

与碎片整理相关的函数

Functions Involved with Defragmentation

如前所述,用于处理碎片整理的主要函数是ip_defrag. 它接收单个片段作为每次调用的输入,并尝试将其添加到正确的数据包中。仅当找到最后一个片段并且数据包完整时,该函数才返回成功。下一节将详细介绍其实现。该函数还接收第二个输入参数 ,user该参数标识请求碎片整理的原因。参见第23章“ ipq结构user部分的描述。

As explained earlier, the main function used to handle defragmentation is ip_defrag. It receives a single fragment as input on each call and tries to add it to the proper packet. The function returns success only when the last fragment has been found and the packet is complete. The next section goes into detail on its implementation. The function also receives a second input parameter, user, that identifies the reason why defragmentation is requested. See the description of user in the section "ipq Structure" in Chapter 23.

图22-1显示了用于存储接收到的IP分片的数据结构;它由数据结构的哈希值组成,每个完整的数据包对应一个数据结构,而数据结构又指向该数据包的片段。

Figure 22-1 shows the data structure used to store received IP fragments; it consists of a hash of data structures, one for each complete packet, that in turn point to the fragments for that packet.

以下是 直接或间接使用的一些支持例程ip_defrag,全部定义在net/ipv4/ip_fragment.c中:

Here are some of the support routines used (directly or indirectly) by ip_defrag, all defined in net/ipv4/ip_fragment.c:

ip_evictor
ip_evictor

ipq从最旧的开始,一一删除不完整数据包的结构,直到碎片使用的内存低于sysctl_ipfrag_low_thresh阈值。

为了ip_evictor正常工作,最近使用的 (LRU) 列表必须保持更新。ipq这只需在全局列表 ( ) 的末尾添加新结构ipq_lru_list,并ipq在每次向列表添加新片段时将结构移动到该列表的末尾即可实现。这意味着最长时间保持不变的元素位于 的头部ipq_lru_list;因此,没有希望完成的数据包(例如,因为发送主机出现故障)会脱颖而出。

Removes ipq structures of incomplete packets one by one, starting from the oldest, until the memory used by the fragments goes below the sysctl_ipfrag_low_thresh threshold.

For ip_evictor to work properly, a Last Recently Used (LRU) list has to be kept updated. This is achieved simply by adding new ipq structures at the end of a global list (ipq_lru_list), and by moving an ipq structure to the end of that list every time a new fragment is added to it. This means that the element that remains untouched for the longest time is at the head of ipq_lru_list; thus, packets that have no hope of being completed (because the transmitting host went down, for instance) stand out at the front.

ip_find
ip_find

查找与正在处理的片段关联的数据包(片段列表)。查找基于 IP 标头的四个字段:ID、源和目标 IP 地址以及 L4 协议。这使得可以非常确定(但不是绝对确定)选择正确的数据包(请参阅第 18 章中的“无法解决的碎片整理问题的示例:NAT ”部分)。查找键实际上也包含一个本地参数:用户。该参数用于确定碎片整理工作背后的原因(请参阅第 23 章中的“ ipq 结构”部分)。

Finds the packet (fragment list) associated with the fragment being processed. The lookup is based on four fields of the IP header: the ID, the source and destination IP addresses, and the L4 protocol. This makes it pretty certain (but not absolutely certain) that the right packet is chosen (see the section "Example of an unsolvable defragmentation problem: NAT" in Chapter 18). The lookup key actually includes a local parameter too: the user. This parameter is used to identify the reason behind the defragmentation effort (see the section "ipq Structure" in Chapter 23).

ip_frag_queue
ip_frag_queue

ipq将给定片段放入与同一 IP 数据包关联的片段列表(结构)中。请参见图22-1和“ ip_frag_queue 函数”部分。

Queues a given fragment to the list of fragments (the ipq structure) associated with the same IP packet. See Figure 22-1 and the section "The ip_frag_queue Function."

ip_frag_reasm
ip_frag_reasm

收到所有片段后,从其片段构建原始 IP 数据包。

Builds the original IP packet from its fragments, once all of them have been received.

以下是用于处理删除的其他一些支持例程ipq

Here are a few other support routines used to handle the deletion of an ipq:

ip_frag_destroy
ip_frag_destroy

删除ipq传递给它的结构及其所有关联的 IP 片段,并更新全局计数器(请参阅“ ip_defrag 函数ip_frag_mem”部分)。该函数是从包装 函数中调用的,而不是直接调用的。ipq_put

Removes the ipq structure passed to it, and all of its associated IP fragments, and updates the global counter ip_frag_mem (see the section "The ip_defrag Function"). This function is called from the wrapper ipq_put function, instead of being called directly.

ipq_put
ipq_put

减少传递给它的结构的引用计数,如果没有其他人持有对它的引用,ipq 则删除该结构和片段:ip_frag_destroy

    static _ _inline_ _ void ipq_put(struct ipq *ipq, int *work)
    {
            if (atomic_dec_and_test(&ipq->refcnt))
                    ip_frag_destroy(ipq, 工作);
    }

当输入参数work不是 NULL 指针时,ipq_put返回并work初始化为 释放的内存量 ip_frag_destroy。例如,这对于ip_evictor被调用以释放给定内存量的调用非常有用,因此需要知道每次调用能够ipq_put释放多少内存。

Decrements the reference count on the ipq structure passed to it, and removes the structure and fragments with ip_frag_destroy if no one else is holding a reference to it:

    static _ _inline_ _ void ipq_put(struct ipq *ipq, int *work)
    {
            if (atomic_dec_and_test(&ipq->refcnt))
                    ip_frag_destroy(ipq, work);
    }

When the input parameter work is not a NULL pointer, ipq_put returns with work initialized to the amount of memory freed by ip_frag_destroy. This is useful, for instance, to an ip_evictor invocation that was called to free a given amount of memory and therefore needs to know how much each call to ipq_put manages to free.

ipq_kill
ipq_kill

将结构标记ipq为可以删除,因为某些片段未及时到达。有关详细信息,请参阅“垃圾收集”部分。

Marks an ipq structure as eligible to be removed because some of the fragments did not arrive in time. See the section "Garbage Collection" for details.

在接下来的两节中,我们将更详细地ip_defrag 了解。ip_frag_queue我们首先看看如何ipq创建一个新的片段列表(实例)。

In the next two sections, we will see ip_defrag and ip_frag_queue in more detail. Let's first see how a new fragment list (ipq instance) is created.

新 ipq 实例初始化

New ipq Instance Initialization

第一个任务ip_defrag是搜索应将接收到的片段添加到的数据包作为输入。为了找到数据包,该函数调用ip_find. 如果该片段恰好是数据包的第一个(就其到达时间而言,不一定是其在数据包中的位置), ip_find则会失败。在本例中, 使用该函数ip_find创建一个新实例。无论是找到结构还是新创建结构,都会返回指向它的指针。使用此指针将新片段插入到正确数据包的结构中。失败(返回 NULL)的唯一情况是尝试创建新的错误时ipqip_frag_createip_findip_defragipqip_findipq元素。

The first task of ip_defrag is to search for the packet to which it should add the fragment it receives as input. To find the packet, the function invokes ip_find. If the fragment happens to be the first of a packet (in terms of the time it arrives, not necessarily its position within the packet), ip_find will fail. In this case, ip_find creates a new ipq instance using the ip_frag_create function. Whether a structure is found or newly created, ip_find returns a pointer to it. ip_defrag uses this pointer to insert the new fragment into the ipq structure of the proper packet. The only case where ip_find fails (returns NULL) is when there is an error trying to create a new ipq element.

新片段的插入由 处理ip_defrag,而新实例的插入由viaipq 处理。除了初始化新结构中的一堆参数之外,该函数还启动一个垃圾收集计时器,如果相关的碎片整理未能在给定时间内完成,该计时器将清理新结构(及其所有碎片)。默认情况下,此超时为 30 秒,但可以通过/proc进行配置(请参阅第 23 章中的“调整 /proc 文件系统”部分)。执行垃圾收集的函数,ip_frag_createip_frag_internipqip_expire,还会生成 ICMP 消息来通知源主机有关碎片整理尝试失败的信息。

While the insertion of a new fragment is handled by ip_defrag, the insertion of a new ipq instance is handled by ip_frag_create via ip_frag_intern. Besides initializing a bunch of parameters in the new structure, this function also starts a garbage collection timer that will clean up the new ipq structure (and all of its fragments) if the associated defragmentation fails to complete within a given amount of time. This timeout, by default, is 30 seconds, but it can be configured via /proc (see the section "Tuning the /proc Filesystem" in Chapter 23). The function that does the garbage collection, ip_expire, also generates an ICMP message to inform the source host about the failed defragmentation attempt.

ip_defrag 函数

The ip_defrag Function

实际的ip_defrag函数非常简单,因为所有的复杂性都在它内部使用的四个函数中:ip_findip_frag_queueip_frag_reasmip_evictor

The actual ip_defrag function is quite simple, because all the complexity is within the four functions it uses internally: ip_find, ip_frag_queue, ip_frag_reasm, and ip_evictor.

    struct sk_buff *ip_defrag(struct sk_buff *skb, u32 用户)
    struct sk_buff *ip_defrag(struct sk_buff *skb, u32 user)

skb输入中接收到的片段ip_defrag包含前面讨论的所有信息,这些信息是识别ipq它所属的实例(如果已经创建)所需的。

The fragment skb received in input by ip_defrag contains all the information discussed earlier that is needed to identify the ipq instance it belongs to (if one is already created).

该函数首先检查 IP 片段使用的内存,ip_evictor如果达到可配置的阈值,则可能会触发垃圾收集。请参阅“垃圾收集”部分。

The function starts with a check on the memory used up by IP fragments, and may trigger a garbage collection with ip_evictor if a configurable threshold has been reached. See the section "Garbage Collection."

如果这是新 IP 数据包的第一个片段,ip_find则创建一个新ipq结构;否则,它只是返回它找到的那个。如果创建了新的,后者将被添加到ipq_hash稍后的ip_frag_queue.

If this is the first fragment of a new IP packet, ip_find creates a new ipq structure; otherwise, it simply returns the one it finds. In case a new one was created, the latter will be added to ipq_hash later with ip_frag_queue.

        if ((qp = ip_find(iph)) != NULL) {
            struct sk_buff *ret = NULL;
        if ((qp = ip_find(iph)) != NULL) {
            struct sk_buff *ret = NULL;

最后,片段被排队。ip_frag_queue 这是一个相当复杂的函数,我们将在下一节中详细分析。片段列表qp受锁保护,以确保不会同时对该列表进行不兼容的访问:

Finally, the fragment is enqueued. ip_frag_queue is quite a complex function, and we will analyze it in detail in the next section. The list of fragments, qp, is protected by a lock to make sure there cannot be simultaneous incompatible accesses to the list:

            spin_lock(&qp->lock);
            ip_frag_queue(qp, skb);
            spin_lock(&qp->lock);
            ip_frag_queue(qp, skb);

如果第一个和最后一个片段都已收到,并且片段的总大小等于原始 IP 数据包的大小,则需要将片段连接在一起以获得原始数据包并将其传递到更高层。 ip_frag_reasm停止与元素关联的计时器 qp,将片段粘合在一起,更新一些全局变量,例如表示片段使用的内存的变量 ( ip_frag_mem),并处理 L4 硬件校验和(请参阅“ L4 校验和”一节) 。我们不会详细描述这个函数,因为它主要由可预测的低级指令组成。

If both the first and the last fragments have been received and the total size of the fragments equals the size of the original IP packet, it is time to join the fragments together to obtain the original packet and pass it to the higher layer. ip_frag_reasm stops the timer associated with the qp element, glues together the fragments, updates a few global variables, such as the one that represents the memory used by fragments (ip_frag_mem), and takes care of the L4 hardware checksum (see the section "L4 checksum"). We will not describe this function in detail because it is composed mostly of predictable, low-level instructions.

            if (qp->last_in == (FIRST_IN|LAST_IN) &&
                qp->meat == qp->len)
                ret = ip_frag_reasm(qp, dev);

            spin_unlock(&qp->lock);
            ipq_put(qp, NULL);
            返回ret;
        }

        IP_INC_STATS_BH(IPSTATS_MIB_REASMFAILS);
        kfree_skb(skb);
        返回空值;
    }
            if (qp->last_in == (FIRST_IN|LAST_IN) &&
                qp->meat == qp->len)
                ret = ip_frag_reasm(qp, dev);

            spin_unlock(&qp->lock);
            ipq_put(qp, NULL);
            return ret;
        }

        IP_INC_STATS_BH(IPSTATS_MIB_REASMFAILS);
        kfree_skb(skb);
        return NULL;
    }

ip_frag_queue 函数

The ip_frag_queue Function

ipq将新片段添加到结构(与同一 IP 数据包关联的片段列表)的任务很复杂,因为用于存储片段的数据结构不是使用字段复制片段的简单数组offset。该解决方案存在一个主要问题:因为原始 IP 数据包的大小在收到最后一个片段之前是未知的,因此这将迫使 IP 层分配大小等于最大 IP 数据包大小的缓冲区。正如您可以想象的那样,这会浪费大量内存。此外,虽然易于实施,但这种解决方案的性能不佳,并且很容易通过 DoS 攻击使路由器瘫痪。

The task of adding a new fragment to an ipq structure (the list of fragments associated with the same IP packet) is complex because the data structure used to store fragments is not a trivial array where fragments are copied using the offset field. That solution would have one major problem: because the size of the original IP packet is not known until the last fragment is received, this would force the IP layer to allocate a buffer of a size equal to the maximum IP packet size. As you can imagine, this would waste a lot of memory. Also, while easy to implement, such a solution would not perform well and would make it very easy to bring a router to its knees by means of a DoS attack.

使用列表来处理片段可以优化所使用的内存,但会使处理片段变得更加复杂。让我们总结一下完成的主要任务ip_frag_queue

The use of a list to handle fragments optimizes the memory used, but makes it a little bit more complicated to handle the fragments. Let's summarize the main tasks accomplished by ip_frag_queue:

  • 根据偏移量和长度确定输入片段在原始数据包中的位置。

  • Figures out where the input fragment falls within the original packet, based on both its offset and its length.

  • 根据本章开头详细介绍的注意事项,确定这是否是数据包的最后一个片段,如果是,则从中提取 IP 数据包的长度。

  • Based on the considerations detailed at the start of this chapter, determines whether this is the last fragment of a packet and, if so, extracts the length of the IP packet from it.

  • 将片段插入与同一 IP 数据包关联的片段列表中,处理可能的重叠。(正如我在前面的章节中所解释的,如果数据包被认为丢失并由主机重新传输(可能是通过具有不同 PMTU 的不同路由),则片段可能会重叠。)

  • Inserts the fragment into the list of fragments associated with the same IP packet, handling possible overlaps. (As I explained in an earlier chapter, fragments can overlap if a packet was believed to be lost and was retransmitted by a host, possibly over a different route with a different PMTU.)

  • 更新ipq垃圾收集任务使用的结构的那些字段(即,时间戳和使用的内存)。

  • Updates those fields of the ipq structure that are used by the garbage collection task (i.e., timestamp and memory used).

  • 如有必要,使硬件中计算的 L4 校验和无效(例如,当片段需要被 截断时ip_frag_queue)。

  • Invalidates the L4 checksum computed in hardware if necessary (e.g., when a fragment needs to be truncated by ip_frag_queue).

这是原型:

This is the prototype:

    静态无效 ip_frag_queue(结构 ipq *qp,结构 sk_buff *skb)
    static void ip_frag_queue(struct ipq *qp, struct sk_buff *skb)

其中qp是该片段所属的 IP 数据包(由调用者使用该ip_find函数找到)并且skb是新片段。

where qp is the IP packet the fragment belongs to (found by the caller by use of the ip_find function) and skb is the new fragment.

该函数首先从 IP 标头中提取数据,并进行一些常规检查以确保片段有效。首先进行一般检查,以确保在 IP 数据包已完全接收时该函数没有被错误调用。该COMPLETE标志通常在收到所有片段后设置,也可以在其他异常情况下设置 - 例如,将ipq_kill一个 ipq元素标记为死亡时。

The function starts by extracting data from the IP header and doing a number of general checks to make sure the fragment is valid. First comes a general check to make sure the function has not been called by mistake when the IP packet has already been completely received. The COMPLETE flag, usually set after all the fragments have been received, could also be set in other unusual circumstances—for instance, when ipq_kill marks an ipq element as dead.

        if (qp->last_in & 完成)
            转到错误;
        if (qp->last_in & COMPLETE)
            goto err;

偏移量存储在 IP 标头中 16 位偏移量字段的最低 13 位中。三个最高有效位中的两个由两个标志使用:DF 和 MF。[ * ]未使用一位。

The offset is stored in the 13 least-significant bits of the 16-bit Offset field in the IP header. Two of the three most-significant bits are used by two flags: DF and MF.[*] One bit is not used.

由于偏移量以8个字节为单位表示,因此该字段中的值必须乘以8才能使用。标头长度ihl以 4 个字节为单位表示,因此必须乘以 4 才能使用。IP_OFFSET只是一个掩码,用于从 16 位字段中提取低 13 位。

Because the offset is expressed in units of eight bytes, the value in the field must be multiplied by 8 before being usable. The header length ihl is expressed in units of four bytes and therefore must be multiplied by 4 before being usable. IP_OFFSET is simply a mask that is used to extract the lower 13 bits from the 16-bit field.

    偏移量 = ntohs(skb->nh.iph->frag_off);
    标志 = 偏移量 & ~IP_OFFSET;
    偏移&= IP_OFFSET;
    偏移量 <<= 3;
    ihl = skb->nh.iph->ihl * 4;
    offset = ntohs(skb->nh.iph->frag_off);
    flags = offset & ~IP_OFFSET;
    offset &= IP_OFFSET;
    offset <<= 3;
    ihl = skb->nh.iph->ihl * 4;

由于 IP 片段携带其偏移量和长度,因此我们可以轻松计算片段在原始 IP 数据包中的结束位置。是 IP 有效负载的大小,并且由于该有效负载位于原始 IP 数据包中的skb->len - ihl偏移量,因此它们的总和给出了该片段在原始 IP 数据包中终止的偏移量。offset

Since the IP fragment carries both its offset and its length, we can easily calculate where the fragment ends within the original IP packet. skb->len - ihl is the size of the IP payload, and since that payload is at offset offset in the original IP packet, their sum gives the offset where this fragment terminates in the original IP packet.

    end = offset + skb->len - ihl;
    end = offset + skb->len - ihl;

如果MF标志没有设置,则表示该片段是最后一个;因此,我们可以提取原始 IP 数据包的总长度,将该长度存储在 中 qp->len,并设置标志LAST_IN。如果从最后一个片段 ( ) 派生的原始数据包的大小与end我们之前定义的值(如果有)不匹配,则意味着该片段或之前的片段之一已损坏,因此该片段将被丢弃。

If the MF flag is not set, it means that the fragment is the last one; we can therefore extract the total length of the original IP packet, store this length in qp->len, and set the flag LAST_IN. If the size of the original packet derived from this last fragment (end) does not match the value we already defined earlier (if any), it means that this fragment or one of the previous ones got corrupted, and therefore this fragment is dropped.

        if ((flags & IP_MF) == 0) {
            if (结束 < qp->len ||
                ((qp->last_in & LAST_IN) && end != qp->len))
                转到错误;
            qp->last_in |= LAST_IN;
            qp->len = 结束;
        } 别的 {
        if ((flags & IP_MF) == 0) {
            if (end < qp->len ||
                ((qp->last_in & LAST_IN) && end != qp->len))
                goto err;
            qp->last_in |= LAST_IN;
            qp->len = end;
        } else {

除最后一个片段外,每个片段都必须是八字节的倍数。因此,如果当前片段不是最后一个片段(未设置 MF)并且其大小不是 8 字节的倍数,则内核会将其截断以使其大小为 8 字节的倍数。(这里希望另一个片段将带着截断的信息到达,并使重建的数据包正确。)由于此操作更改了 L4 有效负载(通过截断数据),因此该函数还必须使校验和无效,以防校验和已经被计算出来的。[ * ]接收 L4 层必须重新计算它。

Every fragment except the last must be a multiple of eight bytes. Thus, if the current fragment is not the last one (MF is not set) and its size is not a multiple of eight bytes, the kernel truncates it to make its size a multiple of eight bytes. (The hope here is that another fragment will arrive with the truncated information and will make the reconstructed packet correct.) Since this operation changes the L4 payload (by truncating the data), the function must also invalidate the checksum in case it had already been computed.[*] The receiving L4 layer will have to recompute it.

            如果(结束&7){
                结束 &= ~7;
                if (skb->ip_summed!= CHECKSUM_UNNECESSARY)
                    skb->ip_summed = CHECKSUM_NONE;
            }
            if (end&7) {
                end &= ~7;
                if (skb->ip_summed != CHECKSUM_UNNECESSARY)
                    skb->ip_summed = CHECKSUM_NONE;
            }

如果片段结束点 ( offset+ len) 大于 的当前值qp->len,则更新后者。请注意, qp->len仅当已收到最后一个片段时,才表示原始碎片整理数据包的长度。因此,如果一个片段在qp->len接收到最后一个片段之后结束,则意味着某处出现错误并且该片段被丢弃。[ ]

If the point where the fragment ends (offset+len) is bigger than the current value of qp->len, the latter is updated. Note that qp->len represents the length of the original defragmented packet only if the last fragment has already been received. For this reason, if a fragment ends past qp->len when the last fragment has been received, it means there is an error somewhere and the fragment is dropped.[]

            if (结束 > qp->len) {
                if (qp->last_in & LAST_IN)
                    转到错误;
                qp->len = 结束;
            }
        }
            if (end > qp->len) {
                if (qp->last_in & LAST_IN)
                    goto err;
                qp->len = end;
            }
        }

根据IP协议的定义,IP报头不能被分段。这意味着如果数据包已被分段,则必须有一个非空的有效负载。由此可见,没有有效负载的片段(即片段在其开始处结束)的情况是没有意义的;因此,如果片段满足此条件,则认为它已损坏。

By definition of the IP protocol, the IP header cannot be fragmented. This means that if a packet has been fragmented, there must be a nonempty payload. It follows that the case of a fragment without a payload (that is, the fragment ends where it starts) would not make sense; therefore, if a fragment meets this condition, it is considered corrupted.

        if(结束==偏移量)
            转到错误;
        if (end == offset)
            goto err;

skb->data现在,该函数通过将偏移量向前移动到 IP 有效负载并更新来删除 IP 标头skb->len;这是通过调用pskb_pull. 然后该函数调用pskb_trim将缓冲区数据部分的长度设置为 IP 有效负载的长度 ( end-offset)。注意,实际上只有以下两种情况才需要进行第二次操作:

Now the function removes the IP header by moving the skb->data offset forward to the IP payload and updating skb->len; this is done by calling pskb_pull. Then the function calls pskb_trim to set the length of the buffer data portion to the length of the IP payload (end-offset). Note that the second operation is actually needed only in the following two cases:

  • 缓冲区仍包含一些 L2 填充。永远不应该出现这种情况,因为如果有任何 L2 填充,它会在路径的较早位置被 删除ip_rcv

  • The buffer still contains some L2 padding. This should never be the case, because if there was any L2 padding, it would have been removed earlier in the path by ip_rcv.

  • IP 片段的大小不是八字节的倍数。在这种情况下,该函数已将长度缩短为八字节的倍数,从而在缓冲区末尾留下一些垃圾。

            if (pskb_pull(skb, ihl) == NULL)
                转到错误;
            if (pskb_trim(skb, 结束偏移))
                转到错误;
  • The size of the IP fragment is not a multiple of eight bytes. In this case, the function has shortened the length to a multiple of eight bytes, thus leaving some garbage at the end of the buffer.

            if (pskb_pull(skb, ihl) == NULL)
                goto err;
            if (pskb_trim(skb, end-offset))
                goto err;

输入参数中包含的片段列表qp (参见图22-1)保持排序,最低片段偏移量位于列表的头部。因此,该函数现在需要在列表中找到添加新片段的位置。

The list of fragments contained in the input qp parameter (see Figure 22-1) is kept sorted, with the lowest fragment offset at the head of the list. Therefore, the function now needs to find where in the list to add the new fragment.

        上一个 = NULL;
        for(next = qp->fragments; next != NULL; next = next->next) {
            if (FRAG_CB(下一个)->偏移量 >= 偏移量)
                休息;
            上一个=下一个;
        }
        prev = NULL;
        for(next = qp->fragments; next != NULL; next = next->next) {
            if (FRAG_CB(next)->offset >= offset)
                break;
            prev = next;
        }

处理重叠

Handling overlaps

现在是时候处理与先前接收到的帧的潜在重叠了。这是通过两个步骤完成的:首先,该函数处理与具有较小起始偏移量的片段的冲突,然后处理具有较高起始偏移量的其他片段。和变量指向上一节中描述的列表,管理旧next片段的列表。prevqp

Now it is time to handle potential overlaps with previously received frames. This is done in two steps: first the function handles conflicts with fragments that have a smaller starting offset, and then it handles the others, which have higher starting offsets. The next and prev variables, which point inside the qp list described in the previous section, manage the list of old fragments.

如果新片段不必放置在列表的头部(prev!=NULL),这意味着我们已经收到了至少一个具有较小 的片段offset,我们需要通过从列表中删除公共部分(如果有的话)来处理插入重叠的片段之一。

If the new fragment does not have to be placed at the head of the list (prev!=NULL), which means we already received at least one fragment with a smaller offset, we need to handle the insertion by removing the common part (if there is any) from one of the overlapping fragments.

为此,该函数只需要确定重叠部分的大小,并从新片段的头部删除该大小的块。请注意,在以下代码中,重叠的存在由i正数标记:

To do this, the function just needs to determine the size of the overlapping portion, and remove a block of that size from the head of the new fragment. Note in the following code that the presence of an overlap is marked by i being a positive number:

        如果(上一个){
            int i = (FRAG_CB(prev)->offset + prev->len) - 偏移量;

            如果(我> 0){
                偏移量+=i;
                if (结束 <= 偏移量)
                    转到错误;
                if (!pskb_pull(skb, i))
                    转到错误;
                if (skb->ip_summed!= CHECKSUM_UNNECESSARY)
                    skb->ip_summed = CHECKSUM_NONE;
            }
        }
        if (prev) {
            int i = (FRAG_CB(prev)->offset + prev->len) - offset;

            if (i > 0) {
                offset += i;
                if (end <= offset)
                    goto err;
                if (!pskb_pull(skb, i))
                    goto err;
                if (skb->ip_summed != CHECKSUM_UNNECESSARY)
                    skb->ip_summed = CHECKSUM_NONE;
            }
        }

当确实与列表中的前一个片段重叠时,该函数offset通过使用从新片段中删除冗余部分来更新之前从标头中提取的字段 pskb_pull,并使硬件中计算的 L4 校验和无效。如果向前移动偏移量意味着起始位置高于片段末尾,则意味着新片段完全包含在已接收的片段中,因此该函数可以简单地返回。

When there is indeed an overlap with the previous fragment in the list, the function updates the offset field it extracted earlier from the header by removing the redundant part from the new fragment using pskb_pull, and invalidates the L4 checksum computed in hardware. If moving the offset ahead means that the start becomes higher than the end of the fragment, it means the new fragment is completely contained in the ones already received, so the function can simply return.

处理完前面的片段后,该函数现在可以处理与后面的片段(具有较高偏移量的片段)可能重叠的情况。这样的情况可能有两种:

Having dealt with the preceding fragments, the function can now take care of a possible overlap with the following fragments (the ones with higher offsets). There can be two such cases:

  • 一个或多个后续片段完全包含在新片段中。

  • One or more following fragments is completely included in the new one.

  • 接下来的一个片段与新片段部分重叠。

  • One following fragment overlaps partially with the new one.

这两种情况如图 22-2所示,其中 P 表示新片段,F 表示旧片段。P1 仅与 F2 重叠(完全包含它),而 P2 与 F3 和 F4 都重叠(完全包含)。

Both cases are illustrated in Figure 22-2, where P indicates a new fragment and F an old one. P1 overlaps only with F2 (completely including it), whereas P2 overlaps with both F3 and F4 (which is completely included).

片段之间的单个和多个重叠的示例

图 22-2。片段之间的单个和多个重叠的示例

Figure 22-2. Example of single and multiple overlaps among fragments

此时,指的是第一个大于新分片偏移量的next分片。offset该函数逐个片段进行处理,直到找到重叠,只需比较新片段的结束位置和列表中的片段的结束位置。(请记住,列表中已有的片段按 的升序排序offset。)

At this point, next refers to the fragment whose offset value is the first one greater than the offset of the new fragment. The function goes fragment by fragment until it finds an overlap, simply comparing where the new fragment ends and where the ones in the list end. (Remember that the fragments already in the list are sorted in increasing order of offset.)

        while (下一个 && FRAG_CB(下一个)->偏移 < 结束) {
            int i = 结束 - FRAG_CB(下一个)->偏移量;
        while (next && FRAG_CB(next)->offset < end) {
            int i = end - FRAG_CB(next)->offset;

如果重叠部分的大小小于新片段的大小,则意味着该函数到达了列表的最后一个重叠片段,并且唯一需要的操作是从列表中已有的片段中删除重叠部分。新片段将在稍后添加。由于该函数截断了列表中已有片段的一部分,因此必须更新截断片段的字段,并且必须使硬件校验和无效qp->meatoffset请注意,当与前一个片段重叠时,该函数会从新片段中删除数据,但当与后续片段重叠时,该函数会执行相反的操作。

If the size of the overlapping part is smaller than the size of the new fragment, it means that the function reached the last overlapping fragment of the list and that the only action needed is to remove the overlapping part from the fragment already in the list. The new fragment will be added later. Since the function truncates part of a fragment already in the list, both qp->meat and the offset field of the truncated fragment must be updated and the hardware checksum must be invalidated. Note that when the overlap is with a previous fragment, the function removes data from the new one, but that when the overlap is with a following fragment, the function does the opposite.

            如果(我<下一个->len){
                if (!pskb_pull(下一个, i))
                    转到错误;
                FRAG_CB(下一个)->偏移+= i;
                qp->肉-= i;
                if (next->ip_summed!= CHECKSUM_UNNECESSARY)
                    下一个->ip_summed = CHECKSUM_NONE;
                休息;
            } 别的 {
            if (i < next->len) {
                if (!pskb_pull(next, i))
                    goto err;
                FRAG_CB(next)->offset += i;
                qp->meat -= i;
                if (next->ip_summed != CHECKSUM_UNNECESSARY)
                    next->ip_summed = CHECKSUM_NONE;
                break;
            } else {

相反,如果该片段完全包含在新片段中,则该函数可以将其从列表中删除(也再次更新qp->meat)。[ * ]

If instead the fragment is completely contained in the new one, the function can remove it from the list (once again updating qp->meat as well).[*]

如果要删除的片段是列表的头部,则必须更新头指针:

If the fragment being removed is the head of the list, the head pointer has to be updated:

                结构sk_buff * free_it =下一个;
                下一个=下一个->下一个;

                如果(上一个)
                    上一个->下一个 = 下一个;
                别的
                    qp->片段=下一个;

                qp->meat -= free_it->len;
                frag_kfree_skb(free_it, NULL);
            }
        }
                struct sk_buff *free_it = next;
                next = next->next;

                if (prev)
                    prev->next = next;
                else
                    qp->fragments = next;

                qp->meat -= free_it->len;
                frag_kfree_skb(free_it, NULL);
            }
        }

最后,在解决了所有可能的重叠之后,该函数可以将新片段插入到列表中并更新结构的一些参数qp,例如meatstamplast_inskb->truesize并且fragment()当前使用的内存ip_frag_mem也被更新。qp也被移动到列表的末尾ipq_lru_list

Finally, after having resolved all possible overlaps, the function can insert the new fragment into the list and update a few parameters of the qp structure, such as meat, stamp, and last_in. skb->truesize and the memory currently used by fragments (ip_frag_mem) also are updated. qp is also moved to the end of the ipq_lru_list list.

        qp->stamp = skb->stamp;
        qp->meat += skb->len;
        atomic_add(skb->truesize, &ip_frag_mem);

        如果(偏移量==0)
            qp->last_in |= FIRST_IN;
        qp->stamp = skb->stamp;
        qp->meat += skb->len;
        atomic_add(skb->truesize, &ip_frag_mem);

        if (offset == 0)
            qp->last_in |= FIRST_IN;

L4 校验和

L4 checksum

如果入口设备支持 L4 硬件校验和,则入口 IP 片段可能已经计算出其 L4 校验和。当片段由 重组时ip_frag_reasm,它将各个片段的校验和 组合起来csum_add,并将结果保存在重组缓冲区中。然而,当满足以下条件之一时,重组缓冲区上的硬件校验和无效(即skb->ip_summed设置为CHECKSUM_NONE):

Ingress IP fragments could already have their L4 checksum computed if the ingress device supports L4 hardware checksumming. When the fragments are reassembled by ip_frag_reasm, it combines the checksums of the individual fragments with csum_add and saves the result in the reassembled buffer. However, when one of the following conditions is met, the hardware checksum on the reassembled buffer is invalidated (i.e., skb->ip_summed is set to CHECKSUM_NONE):

  • 片段(最后一个片段除外)已被截断,ip_defrag因为其大小不是八字节的倍数。

  • A fragment (with the exception of the last one) has been truncated in ip_defrag because its size was not a multiple of eight bytes.

  • 一个片段与至少一个先前接收到的片段重叠。由于重叠是通过删除冗余部分来处理的,因此校验和(也覆盖冗余部分)必须无效。

  • A fragment overlapped with at least one other previously received fragment. Because the overlapping is taken care of by removing the redundant part, the checksum (which covers the redundant part as well) must be invalidated.

垃圾收集

Garbage Collection

内核实现了两种垃圾收集IP分片:

The kernel implements two kinds of garbage collection for IP fragments:

  • 系统内存使用限制

  • System memory usage limit

  • 碎片整理定时器

  • Defragmentation timer

为了防止滥用 IP 碎片整理子系统所使用的内存,对该内存施加了限制并将其存储在变量中,该变量的值可以在运行时通过/procsysctl_ipfrag_high_thresh文件系统进行更改。全局变量代表片段当前使用的内存。每次将新片段添加到表结构或从 表结构中删除时,它都会更新。当达到系统限制时,调用 来释放一些内存。ip_frag_memipq_haship_evictor

As a protection against an abuse of the memory used by the IP defragmentation subsystem, a limit on that memory is imposed and stored in the sysctl_ipfrag_high_thresh variable, whose value can be changed at runtime through the /proc filesystem. The global ip_frag_mem variable represents the memory currently used by fragments. It is updated every time a new fragment is added to or removed from the ipq_hash table structure. When the system limit is reached, ip_evictor is invoked to free some memory.

        if (atomic_read(&ip_frag_mem) > sysctl_ipfrag_high_thresh)
            ip_evictor();
        if (atomic_read(&ip_frag_mem) > sysctl_ipfrag_high_thresh)
            ip_evictor( );

对内存限制的检查是通过(参见“ ip_defrag 函数ip_defrag”部分)实现的。

The check on the memory limit is implemented by ip_defrag (see the section "The ip_defrag Function").

当新IP数据包的第一个片段被添加到表中时(即,当 创建ipq_hash新实例时),内核启动碎片整理计时器。ipq定时器用于丢弃不完整数据包的所有片段,以避免不完整的 IP 数据包停留ipq_hash太久(参见第 23 章“通过 /proc 文件系统调整sysctl_ipfrag_time部分的讨论)。如果片段丢失或延迟足够长的时间,计时器就会到期,并调用其处理程序来进行清理,其中包括以下操作:ip_expire

When the first fragment of a new IP packet is added to the ipq_hash table (i.e., when a new ipq instance is created), the kernel starts a defragmentation timer. The timer is used to discard all the fragments for the incomplete packet to avoid having incomplete IP packets sit in ipq_hash for too long (see the discussion of sysctl_ipfrag_time in the section "Tuning via /proc Filesystem" in Chapter 23). If a fragment is lost or delayed long enough, the timer expires and its handler ip_expire is called to do the cleanup, which consists of the following operations:

  1. 取消ipq结构与ipq_hash表和列表的链接lru_list

  2. Unlinking the ipq structure from the ipq_hash table and from the lru_list list.

  3. 如果ipq包含 IP 数据包的第一个片段,则将 ICMP TIME EXCEEDED 消息发送回源主机。本地主机必须收到第一个片段才能传输 ICMP 消息,因为该消息需要在其有效负载中包含原始 IP 数据包的一部分,并且只有第一个片段包含原始 IP 标头(包含所有选项)未分片数据包的部分)和全部或部分 L4 标头(参见第 21 章中的图 21-4)。仅当接收最后一个片段的设备仍然启动并运行时,才会发送 ICMP 消息,因为它很可能用于传输 ICMP。

  4. If ipq includes the first fragment of the IP packet, sending an ICMP TIME EXCEEDED message back to the source host. The local host must have received the first fragment to be able to transmit the ICMP message because this message needs to include a portion of the original IP packet in its payload, and only the first fragment includes the original IP header (with all of the options of the unfragmented packet) and all or part of the L4 header (see Figure 21-4 in Chapter 21). The ICMP message is sent only if the device the last fragment was received from is still up and running, because it will most probably be used to transmit the ICMP.

  5. 更新失败的碎片整理事件的 SNMP 计数器。

  6. Updating the SNMP counters for the failed defragmentation event.

第一个操作是通过ipq_kill (通过调用ipq_unlink)完成的。因为该函数也在其他上下文中被调用(而不仅仅是被调用),所以它ip_expire停止ipq的计时器的尝试并非无用。当由 调用时,它不会停止任何计时器ip_expire,但在其他情况下可能会停止计时器。如果计时器正在为数据包运行,则ipq当计时器启动时, 的引用计数就会增加。因此,为了保持引用计数正确,请ipq_kill在删除计时器后递减引用计数。

The first operation is accomplished by ipq_kill (by calling ipq_unlink). Because this function is called in other contexts, too—not just by ip_expire—its attempt to stop the ipq's timer is not useless. It will not stop any timer when invoked by ip_expire, but it may stop one in the other cases. If a timer is running for the packet, the ipq's reference count was incremented when the timer was started. Therefore, to keep the reference count correct, ipq_kill decrements the reference count after deleting the timer.

此外ip_expire,还有另外两种情况可能会导致调用ipq_kill

Besides ip_expire, here are two other cases that may lead to a call to ipq_kill:

  • ip_frag_reasm当收到最后一个丢失的片段时调用它。

  • ip_frag_reasm calls it when the last missing fragment is received.

  • ip_evictor(在本节开头介绍的)称为杀死ipq它选择删除的结构。

  • ip_evictor (introduced at the beginning of this section) calls it to kill the ipq structures it selects for deletion.

ipq_kill无论调用的原因是什么,COMPLETE都会设置该标志,并且该ipq结构与它所在的所有列表都取消链接。这意味着该COMPLETE标志不一定指的是完全碎片整理的 IP 数据包。

Regardless of the reason why ipq_kill was called, the COMPLETE flag is set and the ipq structure is unlinked from all lists it was on. This means that the COMPLETE flag does not necessarily refer to completely defragmented IP packets.

哈希表重组

Hash Table Reorganization

我们在图 22-1中看到了传入的碎片在等待碎片整理时如何在内存中组织。在顶层,所有数据包的所有片段都通过名为 的哈希表访问ipq_hash。当哈希函数可以尽可能均匀地分布各个元素时,哈希表的性能最佳。允许大量冲突,这会导致 IP 片段聚集在几个列表中,这些列表挂在某些元素上ipq_hash,会降低性能,甚至允许 DoS 攻击。为了避免这些冲突,Linux 内核使用不同的哈希函数定期重新组织表中的所有 IP 片段。这种机制只有在经常进行重组的情况下才能有效,并且每次选择新的功能时,都无法从前一个功能中猜测出新的功能。

We saw in Figure 22-1 how incoming fragments are organized in memory while waiting to be defragmented. At the top level, all fragments for all packets are accessed through a hash table named ipq_hash. A hash table performs best when the hash function can spread the various elements as uniformly as possible. Allowing a large number of collisions, which would cause IP fragments to be bunched up in a few lists that are hanging off of a few elements of ipq_hash, would degrade performance and even allow DoS attacks. To avoid these collisions, the Linux kernel regularly reorganizes all of the IP fragments in the table using a different hash function. This mechanism can be effective only if the reorganization is done frequently and, every time a new function is selected, the new one cannot be guessed from the previous one.

碎片的重组由计时器启动,ipfrag_init该计时器默认每 10 分钟启动一次并到期。(过期时间可以通过/proc接口进行配置,详见第23章“通过/proc文件系统进行调整” 。

Reorganization of fragments is kicked off by a timer that is started by ipfrag_init and that expires every 10 minutes by default. (The expiration time can be configured by means of the /proc interface, described in "Tuning via /proc Filesystem" in Chapter 23.

定时器到期时执行的函数ipfrag_secret_rebuild非常简单。每次执行时,都会生成一个随机值并将get_random_bytes该值存储在全局变量 中ipfrag_hash_rnd,供哈希函数使用ipqhashfn。然后,哈希表中的每个元素首先被逐一取消链接,重新计算其哈希值(即存储桶)ipqhashfn(现在使用 的新值ipfrag_hash_rnd),最后将其重新插入表中。

The function executed when the timer expires, ipfrag_secret_rebuild, is pretty simple. Every time it is executed, it generates a random value with get_random_bytes and stores the value in the global variable ipfrag_hash_rnd, which is used by the ipqhashfn hash function. Then, one by one, each element in the hash table is first unlinked, its hash (i.e., bucket) is recomputed with ipqhashfn (that now uses the new value of ipfrag_hash_rnd), and finally it is re-inserted into the table.

ipfrag_hash_rnd首先在 中初始化ipfrag_init,而不使用get_random_bytes,因为后一个函数取决于称为“熵”的系统质量,该质量是通过检查传统上在不可预测的时间发生的系统事件随着时间的推移而建立的。在启动时,可能还没有足够的熵来依赖get_random_bytes随机数。

ipfrag_hash_rnd is first initialized in ipfrag_init, without using get_random_bytes, because the latter function depends on a quality of the system known as "entropy," built up over time by checking system events that traditionally happen at unpredictable times. At boot time, there may not be enough entropy yet to rely on get_random_bytes for a random number.

结构的重组ipq不会影响ipq_lru_list列表。

The reorganizations of the ipq structures do not affect the ipq_lru_list list.




[ * ]例如,Cisco Systems 的 PIX 防火墙有一个选项,允许管理员阻止 IP 片段通过,除非它们按从头到尾的顺序接收。

[*] For example, the PIX firewall from Cisco Systems has an option that lets the administrator prevent IP fragments from passing through unless they are received in order from first to last.

[ * ]参见文件include/net/ip.h第 18 章中的图 18-2

[*] See the file include/net/ip.h, and Figure 18-2 in Chapter 18

[ * ]ip_rcv请参阅和中ip_rcv_finish类似条件的注释。

[*] See comments in ip_rcv and ip_rcv_finish for similar conditions.

[ ]MF理论上,损坏的数据包可能是之前设置的 =0 的数据包qp->len,而不是我们现在丢弃的数据包。

[] In theory, the corrupted packet could have been the one with MF=0 that previously set qp->len, instead of the one we are dropping now.

[ * ]frag_kfree_skb更新。ip_frag_mem

[*] frag_kfree_skb updates ip_frag_mem as well.

第 23 章 Internet 协议版本 4 (IPv4):其他主题

Chapter 23. Internet Protocol Version 4 (IPv4): Miscellaneous Topics

本章总结了我们对网络代码中 IPv4 层的讨论。它涵盖了一般主题,例如内核在 IPv4 层中的信息管理、统计信息以及通过/proc的用户界面。本章还简要讨论了 IPv4 协议的局限性,正是这些局限性导致了 IPv6 的发展。

This chapter wraps up our discussion of the IPv4 layer in the networking code. It covers general topics such as the management of information in the IPv4 layer by the kernel, statistics, and the user interface through /proc. The chapter also includes a brief discussion of the limitations of the IPv4 protocol, which led to the development of IPv6.

长期存在的 IP 对等信息

Long-Living IP Peer Information

在IP层,没有状态连接的概念。由于 IP 是无状态协议,因此除了统计信息外,无需保留任何参数或与连接相关的数据结构。(这些是可选的,并且不是协议本身所必需的。)但是,为了提高性能,内核会在每个目标 IP 地址基础上保留有关某些参数的信息。我们稍后会看到一个例子。

At the IP layer, there is no concept of a stateful connection. Because IP is a stateless protocol, there are no parameters or connection-related data structures to keep, except for statistics. (These are optional and are not required by the protocol itself.) However, to improve performance, the kernel keeps information about some parameters on a per-destination IP address base. We will see an example in a moment.

任何最近与 Linux 机器进行数据交换的主机都被视为 IP 对等点。内核为每个对等点分配一个数据结构来保存一些长期存在的信息。目前,结构中保留的参数并不多。最重要的是 IP 数据包 ID。我们在第 18 章中看到,每个 IP 数据包由一个称为 ID 的 16 位字段来标识。与使用单个共享 ID(无论目的地如何,每个 IP 数据包都会递增)不同,为每个 IP 对等方保留一个唯一的实例。(这个解决方案是一个实现选择;它不是由任何标准强加的。)我们已经在第 18 章中对数据包 ID 进行了一些讨论。

Any host that has recently carried on an exchange of data with a Linux box is considered an IP peer. The kernel allocates a data structure for each peer to preserve some long-living information. At the moment, not many parameters are kept in the structure. The most important one is the IP packet ID. We saw in Chapter 18 that each IP packet is identified by a 16-bit field called ID. Instead of having a single shared ID, incremented for each IP packet regardless of the destination, one unique instance is kept for each IP peer. (This solution is an implementation choice; it is not imposed by any standard.) We already had a little discussion on the packet ID in Chapter 18.

对等点由结构表示inet_peer。这些结构在include/net/inetpeer.h中定义,并在“ inet_peer 结构” 部分中描述,以 AVL 树的形式组织,AVL 树是一种众所周知的针对查找而优化的数据结构类型。AVL数据结构我就不详细说了; 你可以在任何编程书籍中找到它。[ * ]然而,值得强调 AVL 树中涉及的权衡。本质上,由于插入和删除操作的定义方式,树保持平衡。由于树是平衡的,因此搜索始终需要 O(lg n ) 时间,其中n是树中元素的数量。一般来说,因为保持树平衡是有代价的,所以当有很多与插入/删除/更改操作相关的查找,并且这些查找的速度特别重要时,通常使用这种数据结构。

Peers are represented by inet_peer structures. These structures, defined in include/net/inetpeer.h and described in the section "inet_peer Structure," are organized in an AVL tree, which is a well-known type of data structure optimized for lookups. I will not go into detail about the AVL data structure ; you can find it in any programming book.[*] However, it is worthwhile to underline the trade-offs involved in an AVL tree. Essentially, the tree is kept balanced thanks to the way in which insert and delete operations are defined. Because the tree is balanced, a search will always take O(lg n) time, where n is the number of elements in the tree. Generally speaking, because keeping the tree balanced comes at a cost, this kind of data structure is usually used when there are many lookups relative to insert/delete/change operations, and when the speed of these lookups is particularly important.

整个 AVL 树以及相关的全局变量(例如peer_total)都受到peer_pool_lock锁的保护。锁可以通过共享模式和独占模式获取。查找仅需要读取权限,因此将以共享模式获取锁,而插入/删除操作必须以独占模式获取锁。

The whole AVL tree and the associated global variables (such as peer_total) are protected by the peer_pool_lock lock. The lock can be acquired in both shared and exclusive modes. Lookups need only read privilege and therefore will acquire the lock in shared mode, whereas insert/delete operations have to acquire the lock in exclusive mode.

初始化

Initialization

对等子系统由 初始化,它在net/ipv4/inetpeer.cinet_initpeers中定义 ,并在引导时初始化 IPv4 协议时调用。ip_init

The peer subsystem is initialized by inet_initpeers, which is defined in net/ipv4/inetpeer.c and is invoked by ip_init when the IPv4 protocol is initialized at boot time.

该函数完成三个主要任务:

That function accomplishes three main tasks:

  • 分配将用于保存inet_peer结构的缓存,该结构将在识别对等点时进行分配。

  • Allocates the cache that will be used to hold inet_peer structures, which will be allocated as peers are recognized.

  • 定义一个阈值 ( inet_peer_threshold),用于限制inet_peer结构使用的内存量。它的值是根据系统中的 RAM 量计算的。当创建一个新条目时,全局计数器peer_total会递增;当元素被删除时,它当然会减少。如果peer_total变得大于阈值,则删除最近使用的元素(请参阅inet_getpeer)。

  • Defines a threshold (inet_peer_threshold) that will be used to limit the amount of memory used by inet_peer structures. Its value is computed based on the amount of RAM in the system. When a new entry is created, the global counter peer_total is incremented; it is of course decremented when an element is removed. If peer_total becomes bigger than the threshold, the most recently used element is removed (see inet_getpeer).

  • 启动垃圾收集计时器。我们在“垃圾收集”部分中描述了此任务。

  • Starts the garbage collection timer. We describe this task in the section "Garbage Collection."

查找

Lookups

搜索的关键是目的地的 IP 地址。主要有两个功能:

The key for a search is the destination's IP address. There are two main functions:

lookup
lookup

这是net/ipv4/inetpeer.c本地的宏,在 AVL 树中实现简单的搜索。

This is a macro local to net/ipv4/inetpeer.c that implements a simple search in an AVL tree.

inet_getpeer
inet_getpeer

其他子系统(例如 TCP 和路由)可以使用此函数来搜索给定条目。该功能建立在lookup.

This function can be used from other subsystems, such as TCP and routing, to search a given entry. This function is built on top of lookup.

inet_getpeer传递搜索关键字(对等方的 IP 地址)和标志 ( create),可用于在搜索失败时请求创建新条目。当创建新条目时,初始 IP 数据包 ID 通过 初始化为随机值secure_ip_id

inet_getpeer is passed the search key (the peer's IP address) and a flag (create) that can be used to ask for the creation of a new entry in case the search failed. When a new entry is created, the initial IP packet ID is initialized to a random value by means of secure_ip_id.

图 23-1显示了 的内部结构 inet_getpeer。该功能非常简单,不需要太多解释。然而,有一点值得澄清:为什么要进行两次查找来查看是否已经存在与所请求的目标地址相同的条目。第二次检查并不是多余的,因为在释放读锁和获取写锁之间可能已经创建了类似的条目并将其添加到树中。

Figure 23-1 shows the internals of inet_getpeer. The function is pretty simple and does not need much explanation. However, there is one point worth clarifying: why there are two lookups to see whether there is already an entry with the same destination address as the one being requested. The second check is not superfluous because a similar entry could have been created and added to the tree between the time the read lock was released and the write lock was acquired.

IP 层如何使用 inet_peer 结构

How the IP Layer Uses inet_peer Structures

在该结构的少数字段中inet_peer,当前 IP 层仅使用两个字段:v4addr,用于标识对等点,以及ip_id_count

Among the few fields of the inet_peer structure, only two are currently used by the IP layer: v4addr, which identifies the peer, and ip_id_count.

的值ip_id_count可以通过 检索 inet_getid,同时自动增加其值。后者永远不会被直接调用。“选择 IP 标头的 ID 字段”部分提供了 IP 层根据上下文使用的包装器列表。

The value of ip_id_count can be retrieved via inet_getid, which automatically increments its value at the same time. The latter is never called directly. The section "Selecting the IP Header's ID Field" offers a list of the wrappers that are used by the IP layer depending on the context.

垃圾收集

Garbage Collection

由于可以创建的实例数量 inet_peer有限,因此有一个计时器 ( peer_periodic_timer) 在子系统初始化时间 ( ) 启动inet_initpeers,并定期删除在给定时间内未使用的条目。定时器处理程序是peer_check_expire.

Because the number of inet_peer instances that can be created is limited, there is a timer (peer_periodic_timer) that is started at subsystem initialization time (inet_initpeers) and that at regular intervals causes the removal of entries that have not been used for a given amount of time. The timer handler is peer_check_expire.

将条目分类为旧条目所需的数量取决于系统的负载程度。当元素数量 ( peer_total) 大于或等于阈值 ( inet_peer_threshold) 时,系统被视为已加载。在已加载的系统上,条目在 120 秒不活动期后将被删除 ( inet_peer_minttl)。在未加载的系统上,该值介于 120 秒和 10 分钟之间 ( ),并且与未完成条目inet_peer_maxttl数 ( ) 成反比。为了避免计时器占用 CPU 资源,每次计时器到期时可移除的元素数量设置为(30)。inet_peerpeer_totalPEER_MAX_CLEANUP_WORK

The amount needed to classify an entry as old depends on how loaded the system is. A system is considered loaded when the number of elements (peer_total) is greater than or equal to the threshold (inet_peer_threshold). On a loaded system, entries are removed after an inactivity period of 120 seconds (inet_peer_minttl). On a system that is not loaded, the value lies between 120 seconds and 10 minutes (inet_peer_maxttl) and is inversely proportional to the number of outstanding inet_peer entries (peer_total). To avoid making the timer a CPU hog, the number of elements removable at each timer expiration is set to PEER_MAX_CLEANUP_WORK (30).

当计时器首次启动时,超时设置为在 后过期inet_peer_minttl,并进行一点扰动以避免与启动时启动的其他计时器同步。此后,计时器就不再真正定期运行。相反,过期时间设置为 10 秒 ( inet_peer_gc_mintime) 到 120 秒 ( inet_peer_gc_maxtime) 之间的值,与条目数量成反比(请参阅peer_check_expire),这意味着条目越多,过期速度越快。

When the timer is first started, the timeout is set to expire after inet_peer_minttl, with a little perturbation to avoid synchronization with other timers started at boot time. After that, the timer does not really run at regular intervals. Instead, the expiration time is set to a value between 10 seconds (inet_peer_gc_mintime) and 120 seconds (inet_peer_gc_maxtime), inversely proportional to the number of entries (see peer_check_expire), which means that the more entries there are, the faster they expire.

inet_peer_unused_head当一个条目过期时,它会被插入到未使用的列表中,其头和尾由两个全局变量和指向inet_peer_unused_tailp。未使用的列表受inet_peer_unused_lock锁保护。如果过期条目仍然被引用(即引用计数大于1),则无法释放它,并保留在未使用列表中;否则它现在就被释放了。

When an entry expires, it is inserted into the unused list, whose head and tail are pointed to by the two global variables inet_peer_unused_head and inet_peer_unused_tailp. The unused list is protected by the inet_peer_unused_lock lock. If an expired entry is still referenced (that is, the reference count is greater than 1), it cannot be freed and it is kept in the unused list; otherwise it, is freed now.

inet_getpeer 函数

图 23-1。inet_getpeer 函数

Figure 23-1. inet_getpeer function

当一个inet_peer结构因过期或不再使用(即,其引用计数降至 0)而要被删除时,它会被插入到未使用列表中,但也保留在 AVL 树中。这意味着 AVL 树上的后续查找可以返回inet_peer当前未使用列表中的条目。

When an inet_peer structure is to be removed, because it expired or because it is not used anymore (i.e., its reference count dropped to 0), it is inserted into the unused list but is kept in the AVL tree, too. This means that subsequent lookups on the AVL tree can return inet_peer entries currently in the unused list.

清除条目的方式是通过该cleanup_once函数,该函数由计时器处理程序调用peer_check_expire,并inet_getpeer在条目数超过允许的限制时调用。输入参数指定实例在符合删除条件之前必须在未使用列表上花费cleanup_once多长时间 。inet_peer所使用的值 0inet_getpeer表示任何实例都符合条件。

The way entries are purged is through the cleanup_once function, which is called by the timer handler peer_check_expire, and by inet_getpeer when the number of entries passes the allowed limit. The input parameter to cleanup_once specifies how long an inet_peer instance must have spent on the unused list before being eligible for deletion. The value 0, as used by inet_getpeer, means that any instance is eligible.

当访问未使用列表中的条目(即通过 AVL 树上的查找选择)时,它将从该列表中删除。因此,一个条目在其生命周期内可以多次加入和离开未使用列表(请参阅 参考资料inet_getpeer)。

When an entry that is in the unused list is accessed (i.e., selected by a lookup on the AVL tree), it gets removed from that list. For this reason, an entry can join and leave the unused list several times during its life (see inet_getpeer).

选择 IP 标头的 ID 字段

Selecting the IP Header's ID Field

初始化IP数据包ID的主要函数是_ _ip_select_ident。该函数可以直接调用,也可以通过ip_select_ident 或间接调用ip_select_ident_more。这两个包装器函数都区分可以分段的数据包和不能分段的数据包(基于 MF 标志)。定义了两种情况:

The main function for the initialization of the IP packet ID is _ _ip_select_ident. This function can be called both directly and indirectly via ip_select_ident or ip_select_ident_more. Both of these wrapper functions differentiate between packets that can and cannot be fragmented (based on the MF flag). Two cases are defined:

数据包不能分片(DF=1)
Packets cannot be fragmented (DF=1)

添加此案例是为了处理某些 Windows 系统的 IP 堆栈中发现的错误。[ * ] ID 是从sock数据结构 ( inet_sk(sk)->sk) 中间接提取的,每次包装器访问它时它都会递增。这可确保 IP ID 在每次传输时都会更改。

This case was added to handle a bug found with some Windows systems' IP stacks.[*] The ID is extracted indirectly from the sock data structures (inet_sk(sk)->sk), where it is incremented each time the wrapper accesses it. This ensures that the IP ID changes at every transmission.

数据包可以分片(DF=0)
Packets can be fragmented (DF=0)

ip_select_ident照顾ID。

ip_select_ident takes care of the ID.

ip_select_ident_more由 TCP 使用(请参阅 ip_queue_xmit),接收另一个输入参数 ( more),该参数在设备支持 TCP 卸载的情况下使用。

ip_select_ident_more, which is used by TCP (see ip_queue_xmit), receives one more input parameter (more) that is used in those cases where the device supports TCP offloading.

让我们回到_ _ip_select_ident

Let's go back to _ _ip_select_ident:

    void _ _ip_select_ident(struct iphdr *iph, struct dst_entry *dst, int more)
    {
        struct rtable *rt = (struct rtable *) dst;
     
        如果(rt){
            if (rt->peer == NULL)
                rt_bind_peer(rt, 1);
            if (rt->peer) {
                iph->id = htons(inet_getid(rt->peer, 更多));
                返回;
            }
        } 别的
            printk(KERN_DEBUG "rt_bind_peer(0) @%p\n",
                   _ _builtin_return_address(0));
     
            ip_select_fb_ident(iph);
    }
    void _ _ip_select_ident(struct iphdr *iph, struct dst_entry *dst, int more)
    {
        struct rtable *rt = (struct rtable *) dst;
     
        if (rt) {
            if (rt->peer == NULL)
                rt_bind_peer(rt, 1);
            if (rt->peer) {
                iph->id = htons(inet_getid(rt->peer, more));
                return;
            }
        } else
            printk(KERN_DEBUG "rt_bind_peer(0) @%p\n",
                   _ _builtin_return_address(0));
     
            ip_select_fb_ident(iph);
    }

我们在“长期存在的 IP 对等点信息”部分中看到,对于每个 IP 对等点,都有一个inet_peer数据结构,其中保存了一个可用于设置 IP 数据包 ID 的计数器 ( iph->id)。_ _ip_select_ident当该 ID 可用时使用该 ID,ip_select_fb_ident否则则回退到该 ID。

We saw in the section "Long-Living IP Peer Information" that for each IP peer there is an inet_peer data structure that keeps, among other things, a counter that can be used to set the IP packet ID (iph->id). _ _ip_select_ident uses this ID when it is available, and falls back to ip_select_fb_ident otherwise.

如果该inet_peer结构尚未在路由缓存条目中初始化rtrt_bind_peer则首先查找与inet_peer对等方关联的结构,如果不存在,则该函数尝试创建它(因为最后一个输入参数设置为rt_bind_peer1)。在内存不足的已加载系统上,此类创建尝试可能会失败,因此无法分配新inet_peer结构。在这种情况下,_ _ip_select_ident生成一个带有 的 ID ip_select_fb_ident,它代表最后的追索权。

If the inet_peer structure is not already initialized in the routing cache entry rt, rt_bind_peer first looks for the inet_peer structure associated with the peer, and if it does not exist, the function tries to create it (because the last input parameter to rt_bind_peer is set to 1). Such creation attempts can fail on a loaded system that runs out of memory and thus cannot afford the allocation of a new inet_peer structure. In this case, _ _ip_select_ident generates an ID with ip_select_fb_ident, which represents the last recourse.

ip_select_fb_ident(其中fb代表后备)的工作方式很简单:它保留一个静态变量 ,ip_fallback_id将其与对等点的目标 IP 地址组合,并将其传递给我们在“查找”secure_ip_id部分中已经看到的函数该解决方案的唯一缺点是,由于该功能可能用于多个对等点,因此不再保证分配给在合理时间内发送到任何给定对等点的连续 IP 数据包的 ID 会有所不同。重要的是,发送到同一目的地的不同 IP 数据包具有不同的 ID,因为 IP ID 是用于进行碎片整理的字段之一。因此,如果具有相同 ID 的不同 IP 数据包被分片并且片段混合在一起,则会出现接收方无法区分属于不同 IP 数据包的片段(请参阅第 18 章中的“将片段与其 IP 数据包关联”部分)。

The way ip_select_fb_ident (where fb stands for fallback) works is simple: it keeps a static variable, ip_fallback_id, combines it with the destination IP address of the peer, and passes it to the secure_ip_id function we already saw in the section "Lookups." The only drawback of this solution is that because this function can potentially be used for several peers, there is no longer a guarantee that the IDs assigned to consecutive IP packets sent to any given peer within a reasonable amount of time will be different. It is important that different IP packets addressed to the same destination have different IDs because the IP ID is one of the fields used to take care of defragmentation. Thus, if different IP packets with the same ID get fragmented and the fragments get mixed, there is no way for the receiver to distinguish the fragments belonging to the different IP packets (see the section "Associating fragments with their IP packets" in Chapter 18).

知识产权统计

IP Statistics

Linux 内核保留几组有关不同事件和条件的统计数据,这些统计数据可用于统计、调试或确认与标准的兼容性。在本章中,我们将仅简要了解 IP 协议层保存哪些统计信息(不涉及 SNMP 基础设施)以及它们如何更新。在前面的章节中,特别是在描述各种函数时,我们看到了一些使用宏来IP_INC_STATS更新某些计数器的值的情况。

The Linux kernel keeps several sets of statistics about different events and conditions that can be useful for accounting, debugging, or confirming compatibility with standards. In this chapter, we will only briefly see what statistics are kept by the IP protocol layer (without touching on the SNMP infrastructure) and how they are updated. In previous chapters, especially when describing the various functions, we saw a few cases where macros such as IP_INC_STATS were used to update the value of some counters.

让我们从包含与 IP 协议相关的所有计数器的数据结构开始。它在net/ipv4/ip_input.cip_statistics中调用并定义。它是一个具有两个指针的向量,每个指针都指向一个由[ * ]结构组成的向量(在include/net/snmp.h中定义),每个 CPU 一个。此类向量的分配在net/ipv4/af_inet.c中完成 。ipstats_mib init_ipv4_mibs

Let's start with the data structure that contains all of the counters associated with the IP protocol. It is called ip_statistics and is defined in net/ipv4/ip_input.c. It is a vector with two pointers, each one pointing to a vector of ipstats_mib [*] structures (defined in include/net/snmp.h), one per CPU. The allocation of such vectors is done in init_ipv4_mibs in net/ipv4/af_inet.c.

    静态 int _ _init init_ipv4_mibs(void)
    {
            ...
            ip_statistics[0] = alloc_percpu(struct ipstats_mib);
            ip_statistics[1] = alloc_percpu(struct ipstats_mib);
            ...
    }
    static int _ _init init_ipv4_mibs(void)
    {
            ...
            ip_statistics[0] = alloc_percpu(struct ipstats_mib);
            ip_statistics[1] = alloc_percpu(struct ipstats_mib);
            ...
    }

ipstats_mib结构被简单地声明为大小为无符号长字段的数组_ _IPSTATS_MIB_MAX,这恰好是include/linux/snmp.hIPSTATS_MIB_ XXX中枚举列表的大小。

The ipstats_mib structure is simply declared as an array of unsigned long fields of size _ _IPSTATS_MIB_MAX, which happens to be the size of the IPSTATS_MIB_ XXX enumeration list in include/linux/snmp.h.

以下是这些值的含义IPSTATS_MIB_ XXX,分为四组。有关更详细的描述,您可以参考 IPv4 的 RFC 2011 和 IPv6 的 RFC 2465。RFC 2011 中未定义IPv4 未使用的计数器IPSTATS_MIX_ XXX( 除外 )。IPSTATS_MIB_INADDRERRORS

Here is the meaning of the IPSTATS_MIB_ XXX values, classified into four groups. For a more detailed description, you can refer to RFC 2011 for IPv4 and RFC 2465 for IPv6. The IPSTATS_MIX_ XXX counters that are not used by IPv4 (with the exception of IPSTATS_MIB_INADDRERRORS) are not defined in RFC 2011.

与接收到的数据包相关的字段
Fields related to received packets

IPSTATS_MIB_INRECEIVES

收到的数据包数量。该字段不区分完整的IP数据包和分片。它还包括将被接受的帧和因任何原因将被丢弃的帧(由于处于混杂模式的接口将帧传送到未寻址到接收接口的帧而丢弃的帧除外)ip_rcv。年初更新ip_rcv

IPSTATS_MIB_INHDRERRORS

由于 IP 标头损坏而被丢弃的数据包(碎片以及非碎片数据包)的数量。ip_rcv该字段可以因不同原因而更新 ip_rcv_finish

IPSTATS_MIB_INTOOBIGERRORS

IPv4 不使用。IPv6 使用它来计算那些无法转发的入口 IP 数据包,因为它们需要分段(与 IPv4 不同,这对于 IPv6 中的路由器来说是不允许的操作)。

IPSTATS_MIB_INNOROUTES

暂时没有使用。它应该对那些由于本地主机没有有效路由而无法转发的入口数据包进行计数。

IPSTATS_MIB_INADDRERRORS

IPv4 目前未使用。IPv6 使用它来对收到的地址类型错误的数据包进行计数。

IPSTATS_MIB_INUNKNOWNPROTOS

使用未知 L4 协议接收的数据包数量(即,未注册该协议的处理程序)。该字段在 中更新ip_local_deliver_finish

IPSTATS_MIB_INTRUNCATEDPKTS

数据包被截断(即,它不包含完整的 IP 标头)。IPv6 使用它,但 IPv4 不使用它。

IPSTATS_MIB_INDISCARDS

丢弃的数据包数量。该计数器不包括由于标头错误而丢弃的数据包;主要包括内存分配问题。ip_rcv 该字段在和中更新ip_rcv_finish

IPSTATS_MIB_INDELIVERS

成功传送到 L4 协议处理程序的数据包数量。该字段在 中更新ip_local_deliver_finish

IPSTATS_MIB_INMCASTPKTS

接收到的组播数据包数。IPv6 使用它,但 IPv4 不使用它。

IPSTATS_MIB_INRECEIVES

Number of packets received. This field does not distinguish between complete IP packets and fragments. It also includes both the ones that will be accepted and the ones that will be discarded for any reason (with the exception of those dropped because an interface in promiscuous mode delivered frames to ip_rcv that were not addressed to the receiving interface). It is updated at the beginning of ip_rcv.

IPSTATS_MIB_INHDRERRORS

Number of packets (fragments as well as nonfragmented packets) that were discarded because of corrupted IP headers. This field can be updated both in ip_rcv and in ip_rcv_finish for different reasons.

IPSTATS_MIB_INTOOBIGERRORS

Not used by IPv4. IPv6 uses it to count those ingress IP packets that cannot be forwarded because they would need to be fragmented (which is not an allowed operation for a router in IPv6, unlike IPv4).

IPSTATS_MIB_INNOROUTES

Not used at the moment. It is supposed to count those ingress packets that could not be forwarded because the local host does not have a valid route.

IPSTATS_MIB_INADDRERRORS

Not used at the moment by IPv4. IPv6 uses it to count those packets received with a wrong address type.

IPSTATS_MIB_INUNKNOWNPROTOS

Number of packets received with an unknown L4 protocol (i.e., no handler for the protocol was registered). This field is updated in ip_local_deliver_finish.

IPSTATS_MIB_INTRUNCATEDPKTS

The packet is truncated (i.e., it does not include a full IP header). It is used by IPv6, but not by IPv4.

IPSTATS_MIB_INDISCARDS

Number of packets discarded. This counter does not include the packets dropped because of header errors; it mainly includes memory allocation problems. This field is updated in ip_rcv and ip_rcv_finish.

IPSTATS_MIB_INDELIVERS

Number of packets successfully delivered to L4 protocol handlers. This field is updated in ip_local_deliver_finish.

IPSTATS_MIB_INMCASTPKTS

Number of received multicast packets. It is used by IPv6, but not by IPv4.

与传输数据包相关的字段
Fields related to transmitted packets

IPSTATS_MIB_OUTFORWDATAGRAMS

需要转发的入口数据包数量。该计数器实际上在数据包传输之前以及理论上仍可能因某种原因被丢弃时递增。ip_forward_finish其值在(对于多播而言是在)中更新ipmr_forward_finish

IPSTATS_MIB_OUTREQUESTS

系统尝试传输(成功或失败)的数据包数量,不包括转发的数据包。ip_ouput该字段在(以及ip_mc_output多播)中更新。

IPSTATS_MIB_OUTDISCARDS

传输失败的报文数。该字段在多个位置进行了更新,包括ip_append_dataip_push_pending_framesraw_send_hdrinc

IPSTATS_MIB_OUTNOROUTES

由于没有路由传输而丢弃的本地生成的数据包数量。通常,此字段会在 发生故障后更新ip_route_output_flowip_queue_xmit是可以更新它的功能之一。

IPSTATS_MIB_OUTMCASTPKTS

传输的多播数据包数。目前 IPv4 未使用。

IPSTATS_MIB_OUTFORWDATAGRAMS

Number of ingress packets that needed to be forwarded. This counter is actually incremented before the packets are transmitted and when they theoretically could still be discarded for some reason. Its value is updated in ip_forward_finish (and in ipmr_forward_finish for multicast).

IPSTATS_MIB_OUTREQUESTS

Number of packets that the system tried to transmit (successfully or not), not including forwarded packets. This field is updated in ip_ouput (and in ip_mc_output for multicast).

IPSTATS_MIB_OUTDISCARDS

Number of packets whose transmission failed. This field is updated in several places, including ip_append_data, ip_push_pending_frames, and raw_send_hdrinc.

IPSTATS_MIB_OUTNOROUTES

Number of locally generated packets discarded because there was no route to transmit them. Normally this field is updated after a failure of ip_route_output_flow. ip_queue_xmit is one of the functions that can update it.

IPSTATS_MIB_OUTMCASTPKTS

Number of transmitted multicast packets. Not used by IPv4 at the moment.

与碎片整理相关的字段
Fields related to defragmentation

IPSTATS_MIB_REASMTIMEOUT

由于某些片段未及时接收而导致碎片整理失败的数据包数量。该值反映的是完整数据包的数量,而不是分段的数量。该字段在 中更新ip_expire,这是当 IP 分片列表由于超时而被丢弃时执行的计时器函数。请注意,此计数器并未按照本节开头提到的两个 RFC 中的定义使用。

IPSTATS_MIB_REASMREQDS

收到的碎片数量(以及尝试重新组装的数量)。该字段在 中更新ip_defrag

IPSTATS_MIB_REASMFAILS

碎片整理失败的数据包数量。由于不同的原因,该字段在多个位置(_ _ip_evictorip_expireip_frag_reasm和)进行了更新。ip_defrag

IPSTATS_MIB_REASMOKS

成功进行碎片整理的数据包数量。该字段在 中更新 ip_frag_reasm

IPSTATS_MIB_REASMTIMEOUT

Number of packets that failed defragmentation because some of the fragments were not received in time. The value reflects the number of complete packets, not the number of fragments. This field is updated in ip_expire, which is the timer function executed when an IP fragment list is dropped due to a timeout. Note that this counter is not used as defined in the two RFCs mentioned at the beginning of this section.

IPSTATS_MIB_REASMREQDS

Number of fragments received (and therefore the number of attempted reassemblies). This field is updated in ip_defrag.

IPSTATS_MIB_REASMFAILS

Number of packets that failed the defragmentation. This field is updated in several places (_ _ip_evictor, ip_expire, ip_frag_reasm, and ip_defrag) for different reasons.

IPSTATS_MIB_REASMOKS

Number of packets successfully defragmented. This field is updated in ip_frag_reasm.

与碎片相关的字段
Fields related to fragmentation

IPSTATS_MIB_FRAGFAILS

失败的碎片化工作的数量。ip_fragment该字段在(以及ipmr_queue_xmit多播)中更新。

IPSTATS_MIB_FRAGOKS

传输的片段数。该字段在 中更新ip_fragment

IPSTATS_MIB_FRAGCREATES

创建的片段数。该字段在 中更新ip_fragment

IPSTATS_MIB_FRAGFAILS

Number of failed fragmentation efforts. This field is updated in ip_fragment (and in ipmr_queue_xmit for multicast).

IPSTATS_MIB_FRAGOKS

Number of fragments transmitted. This field is updated in ip_fragment.

IPSTATS_MIB_FRAGCREATES

Number of fragments created. This field is updated in ip_fragment.

这些计数器的值导出到/proc/net/snmp文件中。

The values of these counters are exported in the /proc/net/snmp file.

每个 CPU 都保留其自己处理的数据包的记帐信息。此外,它还保留两个计数器:一个用于中断上下文中的事件,另一个用于中断上下文之外的事件。因此,ip_statistics每个 CPU 该数组包括两个元素,一个用于中断上下文,一个用于非中断上下文。并非所有事件都可以在两种情况下发生,但为了使事情变得更容易和更清晰,向量的大小被简单地定义为双倍;那些在两种情况之一中没有意义的元素根本就不会被使用。

Each CPU keeps its own accounting information about the packets it processes. Furthermore, it keeps two counters: one for events in interrupt context and the other for events outside interrupt context. Therefore, the ip_statistics array includes two elements per CPU, one for interrupt context and one for noninterrupt context. Not all of the events can happen in both contexts, but to make things easier and clearer, the vector has simply been defined of double in size; those elements that do not make sense in one of the two contexts are simply not to be used.

由于某些代码片段可以在中断上下文和外部中断上下文中执行,因此内核提供了三个不同的宏来将事件添加到 IP 统计向量:

Because some pieces of code can be executed both in interrupt context and outside interrupt context, the kernel provides three different macros to add an event to the IP statistics vector:

    #define IP_INC_STATS(字段) SNMP_INC_STATS(ip_statistics,字段)
    #define IP_INC_STATS_BH(字段) SNMP_INC_STATS_BH(ip_statistics,字段)
    #define IP_INC_STATS_USER(字段) SNMP_INC_STATS_USER(ip_statistics, 字段)
    #define IP_INC_STATS     (field)    SNMP_INC_STATS     (ip_statistics, field)
    #define IP_INC_STATS_BH  (field)    SNMP_INC_STATS_BH  (ip_statistics, field)
    #define IP_INC_STATS_USER(field)    SNMP_INC_STATS_USER(ip_statistics, field)

第一个可以在任一上下文中使用,因为它在内部检查是否在中断上下文中调用它并相应地更新正确的元素。第二个和第三个宏分别用于在中断上下文中和外部发生的事件。宏IP_INC_STATSIP_INC_STATS_BHIP_INC_STATS_USER是在include/net/ip.h中定义的,三个关联的SNMP_INC_ XXX宏是在include/net/snmp.h中定义的。

The first can be used in either context, because it checks internally whether it was called in interrupt context and updates the right element accordingly. The second and the third macros are to be used for events that happened in and outside interrupt context, respectively. The macros IP_INC_STATS, IP_INC_STATS_BH, and IP_INC_STATS_USER are defined in include/net/ip.h, and the three associated SNMP_INC_ XXX macros are defined in include/net/snmp.h.

IP配置

IP Configuration

Linux IP 协议可以由系统管理员以不同的方式手动调整和配置。此调整包括对协议本身和设备配置的更改。四个主要接口是:

The Linux IP protocol can be tuned and configured manually by a system administrator in different ways. This tuning includes both changes to the protocol itself and to device configuration . The four main interfaces are:

ioctl 通过 ifconfig 进行的调用
ioctl calls made via ifconfig

ifconfig是较旧的 Unix 遗留工具,用于在网络设备上配置 IP。

ifconfig is the older Unix-legacy tool for configuring IP on network devices.

RTNetlink 通过 ip
RTNetlink via ip

ip是 IPROUTE2 软件包的一部分,是 Linux 提供的用于在网络设备上配置 IP 的较新工具。

ip, which is part of the IPROUTE2 package, is the newer tool that Linux offers for configuring IP on network devices.

/proc 文件系统
/proc filesystem

可以通过目录/proc/sys/net/ipv4中的文件集合来调整协议行为 。

Protocol behavior can be tuned via a collection of files in the directory /proc/sys/net/ipv4.

RARP/BOOTP/DHCP
RARP/BOOTP/DHCP

这三个协议可用于动态分配 IP 配置到主机及其接口。

These three protocols can be used to dynamically assign an IP configuration to a host and its interfaces.

前面列表中的最后一组协议有一个有趣的变化。它们通常在用户空间中实现,但 Linux 也有一个简单的内核空间实现,与引导选项一起使用时非常有用nfsroot。后者允许内核通过 NFS 挂载根目录(/)。为此,需要在启动时进行 IP 配置,然后系统才能从用户空间初始化 IP 配置(顺便说一句,该配置可以存储在远程分区中,甚至在系统安装时无法使用)根目录)。通过内核启动选项,可以给出 nfsroot静态配置,或指定使用哪些协议(是的,可以同时使用多个协议)来获取配置。IP配置代码位于net/ipv4/ipconfig.c,使用的代码nfsroot位于fs/nfs/nfsroot.c。这两个文件交叉引用了变量和函数,但实际上它们很容易阅读。我们不会讨论它们,因为网络文件系统和用户空间客户端超出了本书的范围。一旦你知道如何阅读宏(在第 7 章_ _setup 中描述),阅读代码就应该变得轻而易举。它很清楚并且评论得很好。

The last set of protocols in the preceding list have an interesting twist. They are normally implemented in user space, but Linux also has a simple kernel-space implementation that is useful when used together with the nfsroot boot option. The latter allows the kernel to mount the root directory (/) via NFS. To do that, it needs an IP configuration at boot time before the system is able to initialize the IP configuration from user space (which, by the way, could be stored in a remote partition and not even be available to the system when it mounts the root directory). Via kernel boot options, it is possible to give nfsroot a static configuration, or specify what protocols (yes, more than one can be used concurrently) to use to obtain the configuration. The IP configuration code is in net/ipv4/ipconfig.c, and the one used by nfsroot is in fs/nfs/nfsroot.c. The two files cross-reference variables and functions, but they are actually simple to read. We will not cover them, because network filesystems and user-space clients are outside the scope of this book. Once you know how to read _ _setup macros (described in Chapter 7), reading the code should become a piece of cake. It is clear and well commented.

列表中的第三项/proc稍后将在“通过 /proc 文件系统进行调整”部分中介绍。

The third item in the list, /proc, is covered later in the section "Tuning via /proc Filesystem."

在本节中,我将介绍一下支持前两项ifconfigip行为的内核接口。这里的目的不是涵盖用户空间命令的内部结构或处理配置请求的相关内核对应部分。它显示用户空间和内核空间如何通信,以及响应用户空间命令而调用的内核函数。

In this section, I will say a bit about the kernel interfaces that support the behavior of the first two items, ifconfig and ip. The purpose here is not to cover the internals of the user-space commands or the associated kernel counterparts that handle configuration requests. It is to show how user space and kernel space communicate, and the kernel functions that are invoked in response to a user-space command.

操作 IP 地址和配置的主要函数

Main Functions That Manipulate IP Addresses and Configuration

net/ipv4/devinet.c中,您可以找到几个函数,可用于向网络接口添加 IP 地址、从接口中删除地址、修改地址、在给定设备索引的情况下检索设备的 IP 配置或者net_device数据结构等。这里我只介绍一些有用的功能,以帮助您理解后面我们谈论 ip 和 ifconfig用户 空间工具时描述的功能。

In net/ipv4/devinet.c, you can find several functions that can be used to add an IP address to a network interface, delete an address from an interface, modify an address, retrieve the IP configuration of a device given its device index or net_device data structure, etc. Here I introduce only a few of the functions that will be useful, to help you to understand the functions described later when we talk about the ip and ifconfig user-space tools.

在阅读这些功能描述之前,有必要回顾一下 IP 层使用的关键数据结构,这些结构在第 19 章中介绍并在本章后面详细描述。例如,单个 IP 地址由结构表示in_ifaddr,设备的完整 IPv4 配置由 in_device结构表示。

Before reading these descriptions of functions, it would be worthwhile reviewing the key data structures used by the IP layer, introduced in Chapter 19 and described in detail later in this chapter. For instance, a single IP address is represented by an in_ifaddr structure and the complete IPv4 configuration of a device by an in_device structure.

inetdev_init inetdev_destroy
inetdev_init and inetdev_destroy

inetdev_init当第一个 IP 配置应用于设备时调用。它分配in_device结构并将其链接到关联的net_device实例。它还在 /proc/sys/net/ipv4/conf/中创建一个目录(请参阅“通过 /proc 文件系统进行调整”部分)。

可以使用 删除 IP 配置inetdev_destroy,这会简单地撤消 中所做的任何操作inetdev_init,并删除所有链接的in_ifaddr结构。后者被删除 ,这也会减少结构 inet_free_ifa上的引用计数。当最后一个引用被释放时(可能是最后一次调用 ),该 实例将被释放。in_devicein_dev_putinet_free_ifain_devicein_dev_finish_destroy

inetdev_init is invoked when the first IP configuration is applied to a device. It allocates the in_device structure and links it to the associated net_device instance. It also creates a directory in /proc/sys/net/ipv4/conf/ (see the section "Tuning via /proc Filesystem").

The IP configuration can be removed with inetdev_destroy, which simply undoes whatever was done in inetdev_init, plus removes all of the linked in_ifaddr structures. The latter are removed with inet_free_ifa, which also decrements the reference count on the in_device structure with in_dev_put. When the last reference is released, probably with the last call to inet_free_ifa, the in_device instance is freed with in_dev_finish_destroy.

inet_alloc_ifa inet_free_ifa
inet_alloc_ifa and inet_free_ifa

这两个函数分别分配和释放in_ifaddr数据结构。当用户向接口添加新地址时,会分配一个新地址。删除可以通过删除单个地址或同时删除所有设备的 IP 配置来触发。这两个例程都使用读复制更新 (RCU) 机制作为强制互斥的手段。

Those two functions allocate and free, respectively, an in_ifaddr data structure. A new one is allocated when a user adds a new address to an interface. A deletion can be triggered by the removal of a single address, or by the removal of all of the devices' IP configurations together. Both routines use the read-copy update (RCU) mechanism as a means to enforce mutual exclusion.

inet_insert_ifainet_del_ifa
inet_insert_ifa and inet_del_ifa

inet_insert_ifa将新in_ifaddr结构添加到 中的列表中in_device。它会检测重复项,如果发现该地址属于另一个地址的子网内,则将该地址标记为辅助地址。例如,假设eth0已经具有地址 10.0.0.1/24。添加新的 10.0.0.2/24 地址时,它将被识别为相对于第一个地址的次要地址。主地址还用于向内核随机数生成器提供熵net_srandom有关主地址和辅助地址的更多信息可以在第 30 章中找到。

inet_del_ifa只需in_ifaddr从关联in_device实例中删除一个结构,确保如果该地址是主地址,则所有关联的辅助地址也会被删除,除非管理员已通过其 /proc/sys/ net/ipv4/conf/显式配置了设备 /promote_secondaries文件不删除辅助地址。相反,当关联的主要地址被移除时,辅助地址可以被提升为主要地址。给定实例,可以使用宏访问此配置 。该 函数接受一个额外的输入参数,可用于判断是否dev_name in_deviceIN_DEV_PROMOTE_SECONDARIESinet_del_ifain_device当最后一个in_ifaddr实例被删除时,结构应该被释放。虽然删除空in_device结构是正常的,但有时调用者可能不会这样做,例如当它知道即将添加新结构时 in_ifaddr

在这两种情况下,添加和删除成功完成都会导致 Netlink 广播通知(请参阅“更改通知:rtmsg_ifartmsg_ifa ”部分),并通过通知链向其他内核子系统发出通知 (请参阅第 4 章)。inetaddr_chain

inet_insert_ifa adds a new in_ifaddr structure to the list within in_device. It detects duplicates and marks the address as secondary if it finds out that it falls within another address's subnet. Suppose, for instance that eth0 already had the address 10.0.0.1/24. When a new 10.0.0.2/24 address is added, it will be recognized as secondary with respect to the first. Primary addresses are also used to feed the entropy of the kernel random number generator with net_srandom. More information on primary and secondary addresses can be found in Chapter 30.

inet_del_ifa simply removes an in_ifaddr structure from the associated in_device instance, making sure that, if the address is primary, all of the associated secondary addresses are removed too, unless the administrator has explicitly configured the device via its /proc/sys/net/ipv4/conf/ dev_name /promote_secondaries file not to remove secondary addresses. Instead, a secondary address can be promoted to a primary one when the associated primary address is removed. Given the in_device instance, this configuration can be accessed with the IN_DEV_PROMOTE_SECONDARIES macro. The inet_del_ifa function accepts an extra input parameter that can be used to tell whether the in_device structure should be freed when the last in_ifaddr instance has been removed. While it is normal to remove the empty in_device structure, sometimes a caller might not do it, such as when it knows it is going to add a new in_ifaddr soon.

In both cases, addition and deletion, successful completion leads to a Netlink broadcast notification with rtmsg_ifa (see the section "Change Notification: rtmsg_ifa") and a notification to the other kernel subsystems via the inetaddr_chainnotification chain (see Chapter 4).

inet_set_ifa
inet_set_ifa

这是一个包装器,如果关联设备不存在,inet_insert_ifa则创建一个in_device结构,并将地址范围设置为 local ( RT_SCOPE_HOST) 对于像 127 这样的地址xxx。有关范围的更多详细信息,请参阅第 30 章中的“范围”部分。

This is a wrapper for inet_insert_ifa that creates an in_device structure if none exists for the associated device, and sets the scope of the address to local (RT_SCOPE_HOST) for addresses like 127.x.x.x. Refer to the section "Scope" in Chapter 30 for more details on scopes.

可以使用许多其他较小的函数来使代码更具可读性。这里有几个:

Many other, smaller functions can be used to make the code more readable. Here are a few of them:

inet_select_addr
inet_select_addr

此函数用于从给定设备上配置的 IP 地址中选择一个 IP 地址。该函数接受可选范围作为参数,可用于缩小查找范围。我们将在第 35 章中看到这个函数的用处。

This function is used to select an IP address among the ones configured on a given device. The function accepts an optional scope as a parameter, which can be used to narrow down the lookup domain. We will see where this function is useful in Chapter 35.

inet_make_maskinet_mask_len
inet_make_mask and inet_mask_len

给定网络掩码由 1 组成的数量,inet_make mask创建关联的网络掩码。例如,输入 24 将生成十进制表示形式 255.255.255.0 的网络掩码。

inet_mask_len则相反,返回十进制网络掩码中 1 的数量。例如,255.255.0.0 将返回 16。

Given the number of 1s the netmask is composed of, inet_make mask creates the associated netmask. For example, an input of 24 would generate the netmask with the decimal representation 255.255.255.0.

inet_mask_len is the converse, returning the number of 1s in a decimal netmask. For instance, 255.255.0.0 would return 16.

inet_ifa_match
inet_ifa_match

给定 IP 地址和网络掩码,inet_ifa_match检查给定的第二个 IP 地址是否属于同一子网。此功能通常用于对辅助地址进行分类,并检查给定的 IP 地址是否属于本地配置的子网之一。例如,参见inet_del_ifa

Given an IP address and a netmask, inet_ifa_match checks whether a given second IP address falls within the same subnet. This function is often used to classify secondary addresses and to check whether a given IP address belongs to one of the locally configured subnets. See, for instance, inet_del_ifa.

for_primary_ifafor_ifa
for_primary_ifa and for_ifa

这两个函数是宏,可用于浏览与in_ifaddr给定结构关联的所有实例in_devicefor_primary_ifa仅考虑主地址,并for_ifa遍历所有地址。

These two functions are macros that can be used to browse all of the in_ifaddr instances associated with a given in_device structure. for_primary_ifa considers only primary addresses, and for_ifa goes through all of them.

更改通知:rtmsg_ifa

Change Notification: rtmsg_ifa

NetlinkRTMGRP_IPV4_IFADDR向对本地配置的 IP 地址的更改感兴趣的用户空间应用程序提供多播组。当本地 IP 地址发生任何变化时,内核使用该rtmsg_ifa函数通知那些注册到该组的应用程序。当两种类型的事件发生时可以调用该函数:

Netlink provides the RTMGRP_IPV4_IFADDR multicast group to user-space applications interested in changes to the locally configured IP addresses. The kernel uses the rtmsg_ifa function to notify those applications that registered to the group when any change takes place on the local IP addresses. The function can be called when two types of events occur:

RTM_NEWADDR
RTM_NEWADDR

设备上已配置新地址。

A new address has been configured on a device.

RTM_DELADDR
RTM_DELADDR

地址已从设备中删除。

An address has been removed from a device.

生成的消息使用 初始化inet_fill_ifaddr,该函数与处理来自用户空间的转储请求的函数相同(使用ip addr list等命令)。该消息包括要添加或删除的地址以及与其关联的设备。

The generated message is initialized with inet_fill_ifaddr, the same function used to handle dump requests from user space (with commands such as ip addr list). The message includes the address being added or removed, and the device associated with it.

那么,谁对这种通知感兴趣呢?路由协议就是一个主要例子。如果您使用 Zebra,您配置的路由协议会​​删除直接或间接依赖于已消失地址的所有路由。在第31章中,您将了解更多关于路由协议与内核路由子系统交互的方式。

So, who is interested in this kind of notification? Routing protocols are a major example. If you are using Zebra, the routing protocols you have configured would like to remove all of the routes that are directly or indirectly dependent on an address that has gone away. In Chapter 31, you will learn more about the way routing protocols interact with the kernel routing subsystem.

inetaddr_chain 通知链

inetaddr_chain Notification Chain

IP子系统使用inetaddr_chain通知链来通知其他内核子系统有关本地设备的IP配置的更改。内核子系统可以inetaddr_chain通过register_inetaddr_notifierunregister_inetaddr_notifier函数来注册和注销自身。以下是此通知链的两个用户示例:

The IP subsystem uses the inetaddr_chain notification chain to notify other kernel subsystems about changes to the IP configuration of the local devices. A kernel subsystem can register and unregister itself with inetaddr_chain by means of the register_inetaddr_notifier and unregister_inetaddr_notifier functions. Here are two examples of users for this notification chain:

路由
Routing

请参阅第 32 章中的“外部事件”部分。

See the section "External Events" in Chapter 32.

网络过滤器伪装
Netfilter masquerading

当 Netfilter 的伪装功能使用本地 IP 地址并且该地址消失时,必须删除使用该地址的所有连接(请参阅net/ipv4/netfilter/ipt_MASQUERADE.c)。

When a local IP address is used by the Netfilter's masquerading feature, and that address disappears, all of the connections that are using that address must be dropped (see net/ipv4/netfilter/ipt_MASQUERADE.c).

当删除 IP 地址和将其添加到本地设备时,将分别通知这两个NETDEV_DOWN事件。此类通知由“操作 IP 地址和配置的主要功能”部分中介绍的和例程NETDEV_UP生成。inet_del_ifainet_insert_ifa

The two NETDEV_DOWN and NETDEV_UP events, respectively, are notified when an IP address is removed and when it is added to a local device. Such notifications are generated by the inet_del_ifa and inet_insert_ifa routines introduced in the section "Main Functions That Manipulate IP Addresses and Configuration."

通过ip进行IP配置

IP Configuration via ip

传统上,Unix 系统管理员使用ifconfigroute和其他命令手动配置接口和路由。目前,Linux 提供了一个总括的ip命令来处理 IP 配置,并带有许多子命令。

Traditionally, Unix system administrators configured interfaces and routes manually using ifconfig, route, and other commands. Currently Linux provides an umbrella ip command to handle IP configuration, with a number of subcommands.

在本节中,我们将了解 IPROUTE2 如何处理主要寻址操作,例如添加和删除地址。一旦熟悉了这些操作,您就可以轻松地理解和阅读其他人的代码。

In this section we will see how IPROUTE2 handles the main addressing operations, such as adding and removing an address. Once you are familiar with these operations, you can easily understand and read through the code for the others.

图 23-2显示了 IPROUTE2 包中与 IP 地址配置活动相关的文件和主要功能。各行的标签是ip关键字,节点显示调用的函数及其所属的文件。例如,命令ip address add将由 处理ipaddr_modify

Figure 23-2 shows the files and the main functions of the IPROUTE2 package that are involved with IP address configuration activities. The labels on the lines are ip keywords, and the nodes show the function invoked and the file the latter belongs to. For instance, the command ip address addwould be handled by ipaddr_modify.

用于地址配置的IPROUTE2文件和函数

图 23-2。用于地址配置的IPROUTE2文件和函数

Figure 23-2. IPROUTE2 files and functions for address configuration

表 23-1显示了用命令行关键字(例如add)指定的操作与内核运行的内核处理程序之间的关联。例如,当内核收到操作请求时,它知道它与addRTM_NEWADDR命令相关联,因此调用. 一些内核操作是过载的,对于这些操作,内核需要额外的标志来准确地弄清楚用户空间命令所要求的内容。请参阅第 36 章的示例。这种关联在 结构体net/ipv4/devinet.c中定义。RTNetlink的介绍请参考第3章inet_rtm_newaddrinet_rtnetlink_table

Table 23-1 shows the association between the operation specified with a command-line keyword (e.g., add) and the kernel handler run by the kernel. For instance, when the kernel receives a request for an RTM_NEWADDR operation, it knows it is associated with an add command and therefore invokes inet_rtm_newaddr. Some kernel operations are overloaded, and for these, the kernel needs extra flags to figure out exactly what the user-space command is asking for. See Chapter 36 for an example. This association is defined in net/ipv4/devinet.c in the inet_rtnetlink_table structure. For an introduction to RTNetlink, refer to Chapter 3.

表 23-1。ip Route 命令和相关的内核操作

Table 23-1. ip route commands and associated kernel operations

CLI 关键字

CLI keyword

手术

Operation

内核处理程序

Kernel handler

add

add

RTM_NEWADDR

RTM_NEWADDR

inet_rtm_newaddr

inet_rtm_newaddr

delete

delete

RTM_DELADDR

RTM_DELADDR

inet_rtm_deladdr

inet_rtm_deladdr

list, lst, show

list, lst, show

RTM_GETADDR

RTM_GETADDR

inet_dumpifaddr

inet_dumpifaddr

flush

flush

RTM_GETADDR

RTM_GETADDR

inet_dumpifaddr

inet_dumpifaddr

list和flush命令需要一些解释list只是向内核发出的转储信息(例如有关给定设备的信息)的请求,flush是清除设备上整个 IP 配置的请求。

The list and flush commands need some explanation. list is simply a request to the kernel to dump information, for instance, about a given device, and flush is a request to clear the entire IP configuration on the device.

这两个函数inet_rtm_newaddrinet_rtm_deladdr是通用函数的包装器 inet_insert_ifa,我们在“操作 IP 地址和配置的主要函数inet_del_ifa”部分中介绍了它们。包装器所做的就是将来自用户空间的请求转换为两个更通用的函数可以理解的输入。它们还过滤与不存在的设备相关的不良请求。

The two functions inet_rtm_newaddr and inet_rtm_deladdr are wrappers for the generic functions inet_insert_ifa and inet_del_ifa that we introduced in the section "Main Functions That Manipulate IP Addresses and Configuration." All the wrappers do is translate the request that comes from user space into an input understandable by the two more-general functions. They also filter bad requests that are associated with nonexistent devices.

通过 ifconfig 配置 IP

IP Configuration via ifconfig

ifconfig在ifconfig.c用户空间文件( net-tools包的一部分)中实现。与ip不同, ifconfig使用ioctl调用来连接内核。但是, ipifconfig处理程序都使用一组函数。在第 3 章中,我们概述了ioctl内核如何处理调用。这里我们需要知道的是与 IPv4 配置相关的请求是由net/ipv4/af_inet.cinet_ioctl中的函数处理的。根据代码你可以看到辅助函数有哪些ioctlinet_ioctl用于处理用户空间命令(例如, devinet_ioctl)。

ifconfig is implemented in the ifconfig.c user-space file (part of the net-tools package). Unlike ip, ifconfig uses ioctl calls to interface to the kernel. However, a set of functions are used by both the ip and ifconfig handlers. In Chapter 3, we had an overview of how ioctl calls are handled by the kernel. Here all we need to know is that the requests related to IPv4 configuration are handled by the inet_ioctl function in net/ipv4/af_inet.c. Based on the ioctl code you can see what helper functions inet_ioctl uses to process the user-space commands (e.g., devinet_ioctl).

对于 IPROUTE2,来自ifconfig的用户空间请求由包装器在内核端处理,包装器最终调用“操作 IP 地址和配置的主要函数”部分中的函数。

As for IPROUTE2, user-space requests from ifconfig are handled on the kernel side by wrappers that end up calling the functions in the section "Main Functions That Manipulate IP Addresses and Configuration."

IP上的IP

IP-over-IP

IP-over-IP,也称为 IP 隧道(或 IPIP),包括在其他 IP 数据包内传输 IP 数据包。该协议在一些非常有趣的情况下非常有用,包括在虚拟专用网络(VPN)中。当然,没有什么是免费的;您可以很好地想象协议加倍的额外重量:因为每个 IP 数据包都有两个 IP 标头,因此对于小数据包来说,开销会变得巨大。实施过程中也存在微妙的复杂性。例如,两个标头的IP选项之间有什么关系?

IP-over-IP, also called IP tunneling (or IPIP), consists of transmitting IP packets inside other IP packets. This protocol is useful in some very interesting cases, including in a Virtual Private Network (VPN). Of course, nothing comes for free; you can well imagine the extra weight of the doubling of the protocol: because each IP packet has two IP headers, the overhead becomes huge for small packets. There are subtle complexities in implementation, too. For instance, what is the relationship between the IP options of the two headers?

如果您只考虑 IPv4 和 IPv6 协议,那么您已经有四种可能的隧道组合。但并非所有这些组合都可能被使用。

If you consider just the IPv4 and IPv6 protocols, you already have four possible combinations of tunneling. But not all of these combinations are likely to be used.

为了使事情变得更复杂(我实际上应该说“灵活”),请记住隧道中的递归次数没有限制。[ * ]

To make things more complex (I should actually say "flexible"), keep in mind that there is no limit to the number of recursions in tunneling.[*]

本书未介绍可在 Linux 中创建的不同隧道接口。不过,考虑到本书这一部分的 IP 实现背景,您可以研究net/ipv4/ipip.cinclude/net/ipip.h中的代码来推导出实现细节。

The different tunnel interfaces that can be created in Linux are not covered in this book. However, given the background on the IP implementation in this part of the book, you can study the code in net/ipv4/ipip.c and include/net/ipip.h to derive the implementation details.

IPv4:出了什么问题?

IPv4: What's Wrong with It?

我们在第18章的“ IP协议:大局” 一节中看到IP协议的主要任务是什么。IPv4 是在大约 25 年前(1981 年)设计的,考虑到自那时以来互联网和网络服务的发展速度,该协议已经显示出它的年龄。由于 IPv4 最初设计时并未考虑到当今的大型网络拓扑和商业用途,因此多年来它已显示出一些局限性。这些问题仅得到部分解决,有时需要对协议进行特殊扩展(例如,无类域间路由)、ToS 的 DiffServ 代码点 (DSCP) 替换、拥塞通知等),有时则通过定义专门的外部协议(例如 IPsec)来解决。

We saw in the section "IP Protocol: The Big Picture" in Chapter 18 what the main tasks are of the IP protocol. IPv4 was designed almost 25 years ago (in 1981), and given the speed with which the Internet and network services have evolved since then, the protocol is showing its age. Because IPv4 was not originally designed with today's big network topologies and commercial uses in mind, it has shown several limitations over the years. These have been only partially solved, sometimes with special extensions to the protocol (e.g., classless interdomain routing), DiffServ Code Point (DSCP) replacement to ToS, congestion notification, etc.), and other times by defining specialized external protocols such as IPsec.

得益于 IPv4 的经验,新的 IPv6 版本协议旨在解决 IPv4 的已知缺点,并考虑到以下方面:

Thanks to the experience gained with IPv4, the new IPv6 version of the protocol has been designed to address the known shortcomings of IPv4, taking into consideration such aspects as:

  • 功能性

  • Functionality

  • 易于配置

  • Ease of configuration

  • 表现

  • Performance

  • 从 IPv4 网络到 IPv6 网络的过渡

  • Transition from IPv4 networks to IPv6 networks

  • 安全

  • Security

当然,设计新协议的委员会会尽力保持 IPv4 和 IPv6 尽可能兼容,并尽可能轻松地从一种协议过渡到另一种协议。这种兼容性和交互不仅要在应用层处理,还要在内核层处理。

Naturally, the committees designing the new protocol have tried to keep IPv4 and IPv6 as compatible as possible, and the transition from one to another as painless as possible. This compatibility and interaction have to be handled not only at the application layer, but also at the kernel layer.

在分析 IPv4 数据包传输时,我们发现分段和选项处理是两个最昂贵的任务。因此,IPv6 解决了这两点也就不足为奇了:

When analyzing IPv4 packet transmission, we saw that fragmentation and options processing were the two most expensive tasks. It should not come as a surprise, therefore, that IPv6 addressed both points:

  • IPv6 中的分段受到限制:IP 数据包只能在源处分段。

  • Fragmentation has been limited in IPv6: an IP packet can be fragmented only at the source.

  • IP 选项的存在有时可能会抑制快速处理路径:对于 PC 上的 Linux 等软件路由器和商业硬件 IP 实现都是如此。对于商业实施,这可能意味着没有选项的 IP 数据包可以在硬件中以更高的速度转发,而带有选项的 IP 数据包必须在软件中处理。IPv6 处理选项的方式也不同:IPv6 使用扩展的概念,其主要优点是并非所有路由器都必须处理它们。

  • The presence of IP options may sometimes inhibit the fast processing path: this is true for both software routers like Linux on a PC and commercial hardware IP implementations. For a commercial implementation, it could mean that IP packets without options can be forwarded in hardware at much higher speed, and the ones with options have to be handled in software. The way options are handled by IPv6 is also different: IPv6 uses the concept of extensions, whose main advantage is that not all of the routers have to process them.

IPv4 的另一大限制是其地址的 32 位大小以及它们所附带的有限层次结构。网络地址转换(NAT)只是部分解决问题的短期解决方案。NAT 有一些限制,下页列出了这些限制。

One other big limitation of IPv4 is the 32-bit size of its addresses and the limited hierarchy they come with. Network Address Translation (NAT) is only a short-term solution that partially solves the problem. NAT comes with some limitations, which are listed on the following page.

  • 每个协议都必须进行特殊处理,因此某些协议并不总是可以通过 NAT 路由器(例如 H323)。

  • Each protocol has to be treated specially, so some protocols don't always work passing through a NAT router (e.g., H323).

  • NAT 路由器成为单点故障。由于它需要保留通过它的所有连接的状态信息,因此设计一个考虑到冗余或安全性的网络并不容易。

  • The NAT router becomes a single point of failure. Because it needs to keep state information for all the connections passing through it, designing a network with redundancy or security in mind is not easy.

  • 当需要支持那些在设计时未考虑到 NAT 支持的复杂协议时(这些协议被认为是“不适合 NAT 的”[*] 的任务非常复杂且计算量大。

  • Its tasks are complex and computationally heavy when there is a need to support those complex protocols that have not been designed with NAT support in mind (these are considered to be "not NAT-friendly"[*]).

IPv4 中有限的地址数量也有助于(由于其有限的层次结构)创建巨大的路由表。一个核心路由器最多可以有数十万条路由。这种趋势很糟糕,原因如下:

The limited number of addresses in IPv4 also contributes (because of its limited hierarchy) to the creation of huge routing tables. A core router can have up to hundreds of thousands of routes. This trend is bad, for a couple of reasons:

  • 路线需要大量内存。

  • The routes require lots of memory.

  • 查找速度较慢。

  • Lookups are slower.

无类域间路由有助于减小路由表的大小,但无法解决IPv4地址空间有限的问题。

Classless interdomain routing helps in reducing the size of the routing tables, but cannot solve the limited address space problem of IPv4.

在IPv6中,地址的大小增加了四倍,这并不意味着地址数量增加了四倍,而是意味着数量增加了2 96倍!这可能会将系统置于 NAT 路由器之外,并使它们成为成熟的互联网公民,从而对新型应用程序产生影响。

In IPv6, the address has been made four times bigger in size, which does not mean four times as many addresses, but rather 296 times as many! This potentially brings systems outside the NAT router and makes them full-fledged citizens of the Internet, with implications for new types of applications.

IPv4 的设计并未考虑安全性。因此,已经开发了几种不同粒度的方法:应用程序端到端解决方案,例如安全套接字层(SSL),主机端到端解决方案,例如IPsec等。每种方法都有自己的优点和缺点。SSL 要求编写应用程序以使用该安全层(位于 TCP 之上),而 IPsec(大多数人识别 VPN 的方式)则不需要:IPsec 位于 L3 层,因此对应用程序是透明的。IPv4 和 IPv6 都可以使用 IPsec,但它更适合 IPv6。

IPv4 was not designed with security in mind. Because of this, several approaches of different granularity have been developed: application end-to-end solutions such as Secure Sockets Layer (SSL), host end-to-end solutions such as IPsec, etc. Each has its own pros and cons. SSL requires the applications to be written to use that security layer (which sits on top of TCP), whereas IPsec (which is what most people identify VPNs with) does not: IPsec sits at the L3 layer and therefore is transparent to applications. IPsec can be used by both IPv4 and IPv6, but it fits better with IPv6.

使用 IPv6,邻近系统也发生了变化。它称为邻居发现,相当于 IPv4 的 ARP。QoS 组件也得到了扩展。

With IPv6, the neighboring system has changed as well. It is called neighbor discovery, and represents the counterpart to ARP for IPv4. The QoS component is also expanded.

在 IPv4 网络中,借助 DHCP 等协议,已经可以执行自动主机配置;然而,一些限制使得该解决方案的即插即用 (PnP) 程度低于应有的水平。IPv6 也通过所谓的自动配置功能解决了这个问题。

With IPv4 networks, it is already possible to carry out automatic host configuration, thanks to protocols such as DHCP; however, some constraints make that solution less Plug and Play (PnP) than it should be. This issue has been solved by IPv6 too, with the so-called autoconfiguration feature.

通过 /proc 文件系统进行调整

Tuning via /proc Filesystem

/proc文件系统在第 3 章中介绍;它为用户提供了一个简单的界面来查看和更改内核参数,并且是较新的sysfs目录的模型。它包含大量文件(或者更确切地说,对用户来说就像文件一样的虚拟数据结构),这些文件映射到内核内部的变量和函数,并且也可用于调整内核网络组件的行为。

The /proc filesystem was introduced in Chapter 3; it provides a simple interface for users to view and change kernel parameters and is the model for the newer sysfs directory. It contains a huge number of files (or rather, virtual data structures that look to the user just like files) that map to variables and functions inside the kernel and that can be used to tune the behavior of the networking component of the kernel as well.

用于 IPv4 的文件调优主要位于两个目录:

The files used for IPv4 tuning are located mainly in two directories:

/proc/sys/net/ipv4/
/proc/sys/net/ipv4/

表 23-2显示了该目录中 IPv4 使用的一些文件。与这些文件关联的内核变量在net/ipv4/sysctl_net_ipv4.c中声明,并在启动时静态注册(参见第 3 章)。请注意,该目录包含的文件比表 23-2中的文件多得多。大多数额外文件与 L4 协议相关,尤其是 TCP。

Table 23-2 shows some of the files in this directory that are used by IPv4. The kernel variables associated with those files are declared in net/ipv4/sysctl_net_ipv4.c and are statically registered at boot time (see Chapter 3). Note that the directory contains many more files than the ones in Table 23-2. Most of the extra files are associated with L4 protocols, especially TCP.

/proc/sys/net/ipv4/conf/
/proc/sys/net/ipv4/conf/

该目录包含内核识别的每个网络设备的子目录,以及其他特殊目录(参见第36章中的图36-4)。这些子目录包括配置设备特定的参数;其中有accept_redirectssend_redirectsaccept_source_route、 和forwarding这些内容将在第 36 章中介绍,但“操作 IP 地址和配置的主要功能promote_secondaries”部分中描述的除外。

This directory contains a subdirectory for each network device recognized by the kernel, plus other special directories (see Figure 36-4 in Chapter 36). Those subdirectories include configuration parameters that are device specific; among them are accept_redirects, send_redirects, accept_source_route, and forwarding. These will be covered in Chapter 36, with the exception of promote_secondaries, which is described in the section "Main Functions That Manipulate IP Addresses and Configuration."

表23-2。/proc/sys/net/ipv4 中与 IPv4 相关的文件

Table 23-2. IPv4-related files in /proc/sys/net/ipv4

/proc 文件名

/proc filename

关联的内核变量

Associated kernel variable

默认值

Default value

atcp_init这些值在启动时根据系统中可用的内存量进行更新。即使它们由 TCP 更新,它们也会被任何使用端口的 L4 协议使用。

a These values are updated by tcp_init at boot time based on the amount of memory available in the system. Even if they are updated by TCP, they are used by any L4 protocol that uses ports.

binet_initpeers该值在引导时根据系统中可用的内存量进行更新。

b This value is updated by inet_initpeers at boot time based on the amount of memory available in the system.

ip_forward

ip_forward

ipv4_devconf.forwarding

ipv4_devconf.forwarding

0

0

ip_no_pmtu_disc

ip_no_pmtu_disc

ipv4_config.no_pmtu_disc

ipv4_config.no_pmtu_disc

0

0

ip_autoconfig

ip_autoconfig

ipv4_config.autoconfig

ipv4_config.autoconfig

0

0

ip_default_ttl

ip_default_ttl

sysctl_ip_default_ttl

sysctl_ip_default_ttl

IPDEFTTL (64)

IPDEFTTL (64)

ip_nonlocal_bind

ip_nonlocal_bind

sysctl_ip_nonlocal_bind

sysctl_ip_nonlocal_bind

0

0

ip_local_port_range

ip_local_port_range

sysctl_ip_local_port_range[0]

sysctl_ip_local_port_range[0]

sysctl_ip_local_port_range[1]

sysctl_ip_local_port_range[1]

1

1

65535一个

65535a

ipfrag_high_tresh

ipfrag_high_tresh

sysctl_ipfrag_high_thresh

sysctl_ipfrag_high_thresh

256K

256K

ipfrag_low_tresh

ipfrag_low_tresh

sysctl_ipfrag_low_thresh

sysctl_ipfrag_low_thresh

192K

192K

ipfrag_time

ipfrag_time

sysctl_ipfrag_time

sysctl_ipfrag_time

IP_FRAG_TIME(30 * 赫兹)

IP_FRAG_TIME (30 * HZ)

ipfrag_secret_interval

ipfrag_secret_interval

sysctl_ipfrag_secret_interval

sysctl_ipfrag_secret_interval

10*60*赫兹

10 * 60 * HZ

ip_dynaddr

ip_dynaddr

sysctl_ip_dynaddr

sysctl_ip_dynaddr

0

0

inet_peer_gc_maxtime

inet_peer_gc_maxtime

inet_peer_gc_maxtime

inet_peer_gc_maxtime

120*赫兹

120 * HZ

inet_peer_gc_mintime

inet_peer_gc_mintime

inet_peer_gc_mintime

inet_peer_gc_mintime

10*赫兹

10 * HZ

inet_peer_maxttl

inet_peer_maxttl

inet_peer_maxttl

inet_peer_maxttl

10*60*赫兹

10 * 60 * HZ

inet_peer_minttl

inet_peer_minttl

inet_peer_minttl

inet_peer_minttl

120*赫兹

120 * HZ

inet_peer_threshold

inet_peer_threshold

inet_peer_threshold

inet_peer_threshold

65536+ 128B

65536 + 128b

表 23-2中的前三个元素是ipv4_devconf和类型的两个数据结构的成员ipv4_config,分别位于include/linux/inetdevice.hinclude/net/ip.h中,本章稍后将对此进行描述。这些结构的其他元素要么导出到其他地方,要么根本不导出(我们将在相关章节中介绍它们)。文件和内核变量的含义如下:

The first three elements in Table 23-2 are members of two data structures of type ipv4_devconf and ipv4_config, located, respectively, in include/linux/inetdevice.h and include/net/ip.h and described later in this chapter. The other elements of those structures are either exported elsewhere or not exported at all (we will cover them in the associated chapters). The meaning of the files and kernel variables is as follows:

ip转发
ip_forward

设置为非零值以使设备能够转发流量。请参阅第 36 章中的“启用和禁用转发”部分。

Set to a nonzero value to enable the device to forward traffic. See the section "Enabling and Disabling Forwarding" in Chapter 36.

ip_no_pmtu_disc
ip_no_pmtu_disc

当为 0 时,启用路径 MTU 发现。

When 0, path MTU discovery is enabled.

ip_自动配置
ip_autoconfig

当主机的 IP 配置是通过 DHCP 等协议完成时,该值设置为 1。请参阅“ IP 配置”部分。

This is set to 1 when the IP configuration of the host was done via a protocol such as DHCP. See the section "IP Configuration."

ip_default_ttl
ip_default_ttl

这是用于单播流量的 IP TTL 字段的默认值。多播流量使用默认值 1,并且没有等效 sysctl变量来设置它。

This is the default value of the IP TTL field used for unicast traffic. Multicast traffic uses the default value of 1 and does not have an equivalent sysctl variable to set it.

ip_nonlocal_bind
ip_nonlocal_bind

当非零时,应用程序可能绑定到主机以外的地址。例如,即使关联的接口已关闭,也可以将套接字绑定到地址。

When nonzero, it is possible for an application to bind to an address that is not local to the host. This allows, for instance, binding a socket to an address even if the associated interface is down.

ip_local_port_range
ip_local_port_range

可用于传出连接的端口范围。

Range of ports that can be used for outgoing connections.

ipfrag_high_thresh
ipfrag_high_thresh

ipfrag_low_thresh
ipfrag_low_thresh

用于限制传入 IP 片段使用的内存量的阈值。当碎片使用的内存达到上限时ipfrag_high_thresh,旧的条目将被删除,直到使用的内存下降ipfrag_low_thresh请参阅“垃圾收集”部分。

Thresholds used to limit the amount of memory used by incoming IP fragments. When the memory used by fragments reaches ipfrag_high_thresh, old entries are removed until the memory used declines to ipfrag_low_thresh. See the section "Garbage Collection."

ipfrag_time
ipfrag_time

传入 IP 片段在过期之前在内存中保留的最长时间。

Maximum amount of time incoming IP fragments are kept in memory before expiring.

ipfrag_secret_interval
ipfrag_secret_interval

提取哈希表中的传入 IP 片段并使用不同的哈希函数重新插入的时间间隔。请参阅第 22 章中的“哈希表重组”部分。

Interval after which the incoming IP fragments that are in the hash table are extracted and reinserted with a different hash function. See the section "Hash Table Reorganization" in Chapter 22.

ip_dynadr
ip_dynaddr

此变量用于处理绑定到与按需拨号接口关联的地址的套接字的情况,这些接口在接口出现之前不会收到任何回复。如果ip_dynaddr设置,套接字将重试绑定。

This variable is used to handle the case of sockets bound to addresses associated with dial-on-demand interfaces that do not receive any reply until the interface comes up. If ip_dynaddr is set, the sockets will retry binding.

inet_peer_threshold
inet_peer_threshold

inet_peer可以分配的最大结构数。

Maximum number of inet_peer structures that can be allocated.

inet_peer_gc_maxtime
inet_peer_gc_maxtime

inet_peer_gc_mintime
inet_peer_gc_mintime

定期垃圾收集之间的时间间隔。由于结构可用的内存量inet_peer受到限制(由inet_peer_threshold),因此有一个常规计时器根据这两个变量来使未使用的条目过期。inet_peer_gc_maxtime当系统负载不重时使用,inet_peer_gc_mintime反之则使用。因此,条目越多,定时器到期的频率就越频繁。

Amount of time between regular garbage collection passes. Since the amount of memory usable by the inet_peer structures is limited (by inet_peer_threshold), there is a regular timer that expires unused entries based on these two variables. inet_peer_gc_maxtime is used when the system is not heavily loaded, and inet_peer_gc_mintime is used in the opposite case. Thus, the more entries there are, the more frequently the timer expires.

inet_peer_maxttl
inet_peer_maxttl

inet_peer_minttl
inet_peer_minttl

条目的最大和最小 TTL inet_peer。出于显而易见的原因,它的值应该大于sysctl_ipfrag_time

Maximum and minimum TTL of inet_peer entries. Its value is supposed to be bigger than sysctl_ipfrag_time, for obvious reasons.

本书这一部分介绍的数据结构

Data Structures Featured in This Part of the Book

第19章的“主要IPv4数据结构”部分简要概述了主要数据结构。本节详细描述了每种数据结构类型。图 23-3显示了定义每个数据结构的文件。

The section "Main IPv4 Data Structures" in Chapter 19 gave a brief overview of the main data structures. This section has a detailed description of each data structure type. Figure 23-3 shows the file that defines each data structure.

iphdr 结构式

iphdr Structure

其字段的含义已在第18 章的“ IP 标头”部分中介绍过。

The meaning of its fields has already been covered in the section "IP Header" in Chapter 18.

ip_options 结构

ip_options Structure

该结构表示需要传输或转发的数据包的选项。选项存储在此结构中,因为它比 IP 标头本身的相应部分更容易读取。

This structure represents the options for a packet that needs to be transmitted or forwarded. The options are stored in this structure because it is easier to read than the corresponding portion of the IP header itself.

内核文件中数据结构的分布

图 23-3。内核文件中数据结构的分布

Figure 23-3. Distribution of data structures in kernel files

让我们逐个领域去吧。如果您阅读了第 18 章中的“ IP 选项”部分,它们应该相当容易理解。经过此描述,您将能够更轻松地理解解析是如何完成的以及 IP 层子系统如何使用其结果,例如处理传入 IP 数据包的代码。一些位字段被组合在一起形成一个无符号字符;这些声明以.:1

Let's go field by field. They should be fairly simple to understand if you have read the section "IP Options" in Chapter 18. After this description, you will be able to understand more easily how the parsing is done and how its results are used by the IP layer subsystems, such as the code that processes incoming IP packets. Some of the bit fields are grouped together into an unsigned char; the declarations of these end with :1.

unsigned char optlen
unsigned char optlen

选项集的长度。正如第 18 章中所解释的,IP 标头的定义将其限制为最大 40 字节。

Length of the set of options. As explained in Chapter 18, this is limited to a maximum of 40 bytes by the definition of the IP header.

unsigned char is_changed:1
unsigned char is_changed:1

如果 IP 标头已被修改(例如 IP 地址或时间戳),则设置。了解这一点很有用,因为如果必须转发数据包,则该字段指示必须重新计算 IP 校验和。

Set if the IP header has been modified (such as an IP address or a timestamp). This is useful to know because if the packet has to be forwarded, this field indicates that the IP checksum has to be recomputed.

_ _u32 faddr
_ _u32 faddr

unsigned char is_strictroute:1
unsigned char is_strictroute:1

unsigned char srr
unsigned char srr

unsigned char srr_is_hit:1
unsigned char srr_is_hit:1

faddr仅对传输的数据包(即本地生成的数据包)和使用源路由的数据包有意义。的值faddr设置为为源路由提供的第一个 IP 地址。请参阅第 19 章中的“选项:严格和松散源路由”部分。

is_strictroute当严格源路由在选项中时,是一个设置为 true 的标志。

srr包含标头中源路由选项的偏移量。如果未使用该选项,则该值为零。

srr_is_hit如果数据包是源路由的并且接收接口的 IP 地址是源路由列表中的地址之一(请参阅ip_options_rcv_srr net /ipv4/ip_options.c),则为 true。

faddr is meaningful only for transmitted packets (that is, those generated locally) and only for those using source routing. The value of faddr is set to the first of the IP addresses provided for source routing. See the section "Option: Strict and Loose Source Routing" in Chapter 19.

is_strictroute is a flag set to true when Strict Source Route is among the options.

srr contains the offset of the Source Route option in the header. If the option is not used, the value is zero.

srr_is_hit is true if the packet was source routed and the IP address of the receiving interface is one of the addresses in the source route list (see ip_options_rcv_srr in net/ipv4/ip_options.c).

unsigned char rr
unsigned char rr

rr非零时,记录路由是 IP 选项之一,该字段的值表示该选项开始的 IP 标头内的偏移量。该字段与 一起使用rr_needaddr

When rr is nonzero, Record Route is one of the IP options and the value of this field represents the offset inside the IP header where the option starts. This field is used together with rr_needaddr.

unsigned char rr_needaddr:1
unsigned char rr_needaddr:1

rr_needaddr为 true 时,记录路由是 IP 选项之一,并且标头中仍有空间用于另一条路由;因此,当前节点应将出接口的 IP 地址复制到 IP 头中指定的偏移量处rr

When rr_needaddr is true, Record Route is one of the IP options and there is still room in the header for another route; therefore, the current node should copy the IP address of the outgoing interface into the IP header at the offset specified by rr.

unsigned char ts
unsigned char ts

ts非零时,时间戳是 IP 选项之一,该字段表示选项开始的 IP 标头内的偏移量。ts_needaddr该字段与和一起使用ts_needtime

When ts is nonzero, Timestamp is one of the IP options and this field represents the offset inside the IP header where the option starts. This field is used together with ts_needaddr and ts_needtime.

unsigned char is_setbyuser:1
unsigned char is_setbyuser:1

该字段仅对传输的数据包有意义,并在通过系统调用从用户空间传递选项时设置setsockopt。然而目前它从未被使用过。

This field makes sense only for transmitted packets and is set when the options were passed from user space with the system call setsockopt. Currently, however, it is never used.

unsigned char is_data:1
unsigned char is_data:1

unsigned char _data[0]
unsigned char _data[0]

这些字段在两种情况下使用:当本地节点传输本地生成的数据包时,以及当本地节点回复 ICMP 回显请求时。在这些情况下,is_data为 true 并_data指向包含附加到 IP 标头的选项的区域。该[0]定义是用于为指针保留空间的常见约定。

转发数据包时,选项位于关联的skb缓冲区中(请参阅net/ipv4/ip_options.cip_options_get文件中的函数)。

These fields are used in two situations: when the local node transmits a locally generated packet, and when the local node replies to an ICMP echo request. In these cases, is_data is true and _data points to an area containing the options to append to the IP header. The [0] definition is a common convention used for reserving space for a pointer.

When forwarding a packet, the options are in the associated skb buffer (see the ip_options_get function in the net/ipv4/ip_options.c file).

unsigned char ts_needtime:1
unsigned char ts_needtime:1

当此选项为 true 时,时间戳是 IP 选项之一,并且标头中仍有空间用于另一个时间戳;因此,当前节点应将传输时间添加到 IP 标头中指定的偏移量处 ts

When this option is true, Timestamp is one of the IP options and there is still room in the header for another timestamp; therefore, the current node should add the time of transmission into the IP header at the offset specified by ts.

unsigned char ts_needaddr:1
unsigned char ts_needaddr:1

ts与和 一起使用ts_needtime,指示出口设备的 IP 地址也应复制到 IP 标头中。

Used with ts and ts_needtime to indicate that the IP address of the egress device should also be copied into the IP header.

unsigned char router_alert
unsigned char router_alert

当此选项为 true 时,路由器警报是 IP 选项之一。

When this option is true, Router Alert is one of the IP options.

unsigned char _ _pad1,_ _pad2
unsigned char _ _pad1, _ _pad2

_ _pad 由于当位置与 32 位边界对齐时,内存访问速度会更快,因此 Linux 内核数据结构通常会使用称为n的未使用字段进行填充,以使它们的大小成为 32 位的倍数。_ _pad1这是和的唯一目的_ _pad2;它们不被用于其他用途。

Because memory accesses are faster when the location is aligned to a 32-bit boundary, the Linux kernel data structures are often padded out with unused fields called _ _pad n in order to make their sizes a multiple of 32 bits. This is the only purpose of _ _pad1 and _ _pad2; they are not used otherwise.

在解析选项时,标志srrrrtsalso 很有用,以便检测多次出现的选项,这是非法的(请参阅第19 章中的“选项解析”部分)。

The flags srr, rr, and ts also are useful when parsing the options in order to detect the ones that are present more than once, which is illegal (see the section "Option Parsing" in Chapter 19).

ipcm_cookie 结构

ipcm_cookie Structure

该结构结合了传输数据包所需的各种信息。

This structure combines various pieces of information needed to transmit a packet.

    结构 ipcm_cookie
    {
        u32 地址;
        int oif;
        结构 ip_options *opt;
    };
    struct ipcm_cookie
    {
        u32                 addr;
        int                 oif;
        struct ip_options   *opt;
    };

目标 IP 地址是addr,出口设备是oif(如果已定义),并且 IP 选项位于ip_options结构中。请注意,这addr是唯一始终设置的字段。oif如果对使用哪个设备没有限制,则为 0。

The destination IP address is addr, the egress device is oif if defined, and the IP options are in an ip_options structure. Note that addr is the only field that is always set. oif is 0 if there are no constraints on which device to use.

ipq结构

ipq Structure

这里是结构体字段的描述ipq。为了简单起见,第22章图22-1并未显示所有字段。

Here is the description of the fields of the ipq structure. For the sake of simplicity, not all fields are shown in Figure 22-1 in Chapter 22.

struct ipq *next
struct ipq *next

当片段被放入哈希ipq_hash表时,冲突的元素(具有相同哈希值的元素)将通过该字段链接在一起。请注意,该字段并不指示数据包内分段的顺序;它只是用作组织哈希表的标准方式。数据包内分段的顺序由字段控制fragments(参见第 22 章中的图 22-1)。

When the fragments are put into the ipq_hash hash table, conflicting elements (elements with the same hash value) are linked together with this field. Note that this field does not indicate the order of fragments within the packet; it is used simply as a standard way to organize the hash table. The order of fragments within the packet is controlled by the fragments field (see Figure 22-1 in Chapter 22).

struct ipq **pprev
struct ipq **pprev

返回具有相同哈希值的 IP 数据包列表头部的指针。

Pointer back to the head of the list of IP packets that have the same hash value.

struct list_head lru_list
struct list_head lru_list

所有结构都根据最近最少使用的标准ipq在全局列表 中进行排序。ipq_lru_list该列表在执行垃圾收集时很有用。该字段用于将ipq 结构链接到此类列表。

All of the ipq structures are kept sorted in a global list, ipq_lru_list, based on a least-recently-used criterion. This list is useful when performing garbage collection. This field is used to link the ipq structure to such a list.

u32 user
u32 user

IP数据包之所以要进行碎片整理,间接说明了哪个内核子系统要求进行碎片整理。允许值的列表 IP_DEFRAG_ XXX位于include/net/ip.h中。最常见的是IP_DEFRAG_LOCAL_DELIVER,它在对要本地传送的入口数据包进行碎片整理时使用。

The reason why an IP packet is to be defragmented, which indirectly says what kernel subsystem asked for the defragmentation. The list of allowed values for IP_DEFRAG_ XXX is in include/net/ip.h. The most common one is IP_DEFRAG_LOCAL_DELIVER, which is used when defragmenting ingress packets that are to be delivered locally.

u32 saddr
u32 saddr

u32 daddr
u32 daddr

u16 id
u16 id

u8 protocol
u8 protocol

这些参数分别表示源IP地址、目的IP地址、IP数据包ID和L4协议标识符。如第18章所述,这四个参数标识了片段所属的原始IP数据包。因此,它们也是哈希函数用来在整个哈希表中最佳分布元素的参数。

These parameters represent the source IP address, destination IP address, IP packet ID, and L4 protocol identifier, respectively. As described in Chapter 18, these four parameters identify the original IP packet a fragment belongs to. For that reason, they are also the parameters used by the hash function to optimally spread elements throughout the hash table.

u8 last_in
u8 last_in

存储三个标志,其可能的值为:

COMPLETE

所有片段均已收到,因此可以将其连接在一起以获得原始 IP 数据包。该标志还可用于标记那些ipq已选择删除的结构(请参阅ipq_killnet /ipv4/ip_fragment.c)。

FIRST_IN

第一个片段(offset=0 的片段)已收到。第一个片段是唯一携带原始 IP 数据包中所有选项的片段。

LAST_IN

最后一个片段(=0 的片段MF)已收到。最后一个片段很重要,因为它告诉我们原始 IP 数据包的大小。

Stores three flags, whose possible values are:

COMPLETE

All of the fragments have been received and can therefore be joined together to obtain the original IP packet. This flag can also be used to mark those ipq structures that have been chosen for deletion (see ipq_kill in net/ipv4/ip_fragment.c).

FIRST_IN

The first of the fragments (the one with offset=0) has been received. The first fragment is the only one carrying all of the options that were in the original IP packet.

LAST_IN

The last of the fragments (the one with MF=0) has been received. The last fragment is important because it is the one that tells us the size of the original IP packet.

struct sk_buff *fragments
struct sk_buff *fragments

到目前为止收到的片段列表。

List of fragments received so far.

int len
int len

具有最大偏移量的片段结束处的偏移量。当收到最后一个片段(=0 的片段MF)时,len将告知原始 IP 数据包的大小。

Offset where the fragment with the biggest offset ends. When the last fragment is received (the one with MF=0), len will tell the size of the original IP packet.

int meat
int meat

表示到目前为止我们已收到的原始数据包的字节数。当其值与 相同时len,表示数据包已完全接收。

Represents how many bytes of the original packet we have received so far. When its value is the same as len, the packet has been completely received.

spinlock_t lock
spinlock_t lock

保护结构免受竞争条件的影响。例如,可能会发生不同的 IP 片段由不同 CPU 处理的不同 NIC 同时接收的情况。

Protects the structure from race conditions. It could happen, for instance, that different IP fragments are received at the same time by different NICs handled by different CPUs.

atomic_t refcnt
atomic_t refcnt

计数器用于跟踪对此数据包的外部引用。作为其用途的一个示例,timer计时器递增refcnt以确保ipq在计时器仍处于挂起状态时没有人会释放该结构;否则,计时器可能会过期并尝试访问不再存在的数据结构。你可以想象后果。

Counter used to keep track of external references to this packet. As an example of its purpose, the timer timer increments refcnt to make sure that no one is going to free the ipq structure while the timer is still pending; otherwise, the timer might expire and try to access a data structure that does not exist anymore. You can imagine the consequences.

struct timer_list timer
struct timer_list timer

第 18 章解释了为什么 IP 碎片不能永远保留在内存中,如果无法进行碎片整理,则应在一段时间后将其删除。该字段是负责处理该问题的计时器。

Chapter 18 explained why IP fragments cannot stay forever in memory and should be removed after some time if defragmentation is not possible. This field is the timer that takes care of that.

int iif
int iif

接收最后一个片段的设备的 ID。当分片列表过期时,该字段用于决定使用哪个设备来传输 FRAGMENTATION REASSEMBLY TIMEOUT ICMP 消息(请参阅ip_expirenet /ipv4/ip_fragment.c文件)。

ID of the device from which the last fragment was received. When a list of fragments expires, this field is used to decide which device to use to transmit the FRAGMENTATION REASSEMBLY TIMEOUT ICMP message (see ip_expire in the net/ipv4/ip_fragment.c file).

struct timeval stamp
struct timeval stamp

收到最后一个片段的时间(参见ip_frag_queuenet /ipv4/ip_fragment.c)。

Time when the last fragment was received (see ip_frag_queue in net/ipv4/ip_fragment.c).

ipq_hash表受 保护ipfrag_lock,可以采用共享(只读)或独占(读写)模式。不要将此锁与每个 ipq元素中嵌入的锁混淆。

The ipq_hash table is protected by ipfrag_lock, which can be taken in either shared (read-only) or exclusive (read-write) mode. Do not confuse this lock with the one embedded in each ipq element.

inet_peer结构

inet_peer Structure

内核为最近与之通信的每个远程主机保留一个该结构的实例。在“长期 IP 对等信息”部分中,您了解了如何使用它。结构的所有实例inet_peer都保存在 AVL 树中,这是一种针对频繁查找而优化的结构。用于操作实例的函数inet_peer位于net/ipv4/inetpeer.c中。

The kernel keeps an instance of this structure for each remote host it has been talking to in the recent past. In the section "Long-Living IP Peer Information," you saw how it is used. All instances of inet_peer structures are kept in an AVL tree, a structure optimized for frequent lookups. The functions used to manipulate inet_peer instances are in net/ipv4/inetpeer.c.

struct inet_peer *avl_left
struct inet_peer *avl_left

struct inet_peer *avl_right
struct inet_peer *avl_right

指向两个子树的左指针和右指针。

Left and right pointers to the two subtrees.

_ _u16 avl_height
_ _u16 avl_height

AVL 树的高度。

Height of the AVL tree.

struct inet_peer *unused_next
struct inet_peer *unused_next

struct inet_peer **unused_prevp
struct inet_peer **unused_prevp

用于将节点链接到包含过期元素的列表。 unused_prevp用于检查节点是否在该列表中。

一个节点可以被放入该列表中,然后多次从列表中取出,而不会被完全删除。请参阅“垃圾收集”部分。

Used to link the node into a list that contains elements that expired. unused_prevp is used to check whether the node is in that list.

A node can be put into that list and then taken back out of it several times without ever being removed completely. See the section "Garbage Collection."

unsigned long dtime
unsigned long dtime

inet_peer_unused_head通过将此元素添加到未使用列表的时间inet_putpeer

Time when this element was added to the unused list inet_peer_unused_head via inet_putpeer.

atomic_t refcnt
atomic_t refcnt

元素的引用计数。该结构的用户包括路由子系统和 TCP 层。

Reference count for the element. Among the users of this structure are the routing subsystem and the TCP layer.

_ _u32 v4daddr
_ _u32 v4daddr

远程对等点的 IP 地址。

IP address of the remote peer.

_ _u16 ip_id_count
_ _u16 ip_id_count

该对等点下一步要使用的 IP 数据包 ID(请参阅inet_getidinclude /net/inetpeer.h)。

IP packet ID to use next for this peer (see inet_getid in include/net/inetpeer.h).

_ _u32 tcp_ts
_ _u32 tcp_ts

unsigned long tcp_ts_stamp
unsigned long tcp_ts_stamp

TCP 使用它来管理时间戳。

Used by TCP to manage timestamps.

ipstats_mib结构

ipstats_mib Structure

SNMP 协议使用一种称为 MIB 的对象来收集有关系统的统计信息。ipstats_mib一个称为IP 层的数据结构保存着统计信息。“ IP 统计”部分更详细地介绍了该结构。

The SNMP protocol employs a type of object called an MIB to collect statistics about systems. A data structure called ipstats_mib keeps statistics on the IP layer. The section "IP Statistics" covered this structure in more detail.

in_device结构体

in_device Structure

in_device结构存储网络设备的所有 IPv4 相关配置,例如用户使用 ifconfig ip命令所做的更改。net_device该结构通过 和链接到结构net_device->ip_ptr,并且可以使用in_dev_get和进行检索_ _in_dev_get。这两个函数之间的区别在于,第一个函数负责所有必要的锁定,第二个函数假设调用者已经处理了它。

The in_device structure stores all of the IPv4-related configuration for a network device, such as changes made by a user with the ifconfig or ip command. This structure is linked to the net_device structure via net_device->ip_ptr and can be retrieved with in_dev_get and _ _in_dev_get. The difference between those two functions is that the first one takes care of all of the necessary locking, and the second one assumes the caller has taken care of it already.

由于在成功时(即,当设备配置为支持 IPv4 时)in_dev_get会在内部增加结构上的引用计数in_dev,因此其调用者应该在in_dev_put完成该结构时减少引用计数。

Since in_dev_get internally increases a reference count on the in_dev structure when it succeeds (i.e., when a device is configured to support IPv4), its caller is supposed to decrement the reference count with in_dev_put when it is done with the structure.

该结构通过 分配并链接到设备inetdev_init,当在设备上配置第一个 IPv4 地址时调用该结构。其字段含义如下:

The structure is allocated and linked to the device with inetdev_init, which is called when the first IPv4 address is configured on the device. Here are the meanings of its fields:

struct net_device *dev
struct net_device *dev

返回关联net_device 结构的指针。

Pointer back to the associated net_device structure.

atomic_t refcnt
atomic_t refcnt

参考计数。该字段为 0 之前,无法释放该结构。

Reference count. The structure cannot be freed until this field is 0.

int dead
int dead

该字段设置为将设备标记为失效。这对于检测条目由于具有非零引用计数而无法被销毁但已启动销毁操作的情况很有用。触发移除结构的两个最常见的事件in_device是:

  • 设备注销(参见第 8 章

  • 从设备中删除最后配置的 IP 地址(参见inet_del_ifanet /ipv4/devinet.c

This field is set to mark the device as dead. This is useful to detect those cases where the entry cannot be destroyed because it has a nonzero reference count, but a destroy action has been initiated. The two most common events that trigger the removal of an in_device structure are:

  • Unregistration of the device (see Chapter 8)

  • Removal of the last configured IP address from the device (see inet_del_ifa in net/ipv4/devinet.c)

struct in_ifaddr *ifa_list
struct in_ifaddr *ifa_list

设备上配置的 IPv4 地址列表。实例in_ifaddr按范围排序(范围较大的优先),具有相同范围的元素按地址类型排序(主要优先)。“ in_ifaddr 结构in_ifaddr”部分进一步描述了该数据结构。

List of IPv4 addresses configured on the device. The in_ifaddr instances are kept sorted by scope (bigger scope first), and elements with the same scope are kept sorted by address type (primary first). The in_ifaddr data structure is further described in the section "in_ifaddr Structure."

struct neigh_parms *arp_parms
struct neigh_parms *arp_parms

该字段的含义在第六部分中有详细描述。

The meaning of this field is described in detail in Part VI.

struct ipv4_devconf cnf
struct ipv4_devconf cnf

请参阅“ ipv4_devconf 结构”部分

See the section "ipv4_devconf Structure"

struct rcu_head rcu_head
struct rcu_head rcu_head

由 RCU 机制用来强制执行互斥。它完成与锁相同的工作。

Used by the RCU mechanism to enforce mutual exclusion. It accomplishes the same job as a lock.

其余字段由多播代码使用。例如,mc_list存储设备的多播配置,它是 的多播对应项ifa_listmr_vl_seenmr_v2_seen 是 IGMP 协议用来跟踪版本 1 和 2 IGMP 数据包的接收情况的时间戳。

The rest of the fields are used by the multicast code. For instance, mc_list stores the device's multicast configuration and it is the multicast counterpart of ifa_list. mr_vl_seen and mr_v2_seen are timestamps used by the IGMP protocol to keep track of the reception of versions 1 and 2 IGMP packets.

in_ifaddr 结构

in_ifaddr Structure

在接口上配置 IPv4 地址时,内核会创建一个in_ifaddr包含 4 字节地址以及其他几个字段的结构。以下是它们的含义:

When configuring an IPv4 address on an interface, the kernel creates an in_ifaddr structure that includes the 4-byte address along with several other fields. Here are their meanings:

struct in_ifaddr *ifa_next
struct in_ifaddr *ifa_next

指向列表中下一个元素的指针。该列表包含设备上配置的所有地址。

Pointer to the next element in the list. The list contains all of the addresses configured on the device.

struct in_device *ifa_dev
struct in_device *ifa_dev

返回关联in_device 结构的指针。

Pointer back to the associated in_device structure.

u32 ifa_local
u32 ifa_local

u32 ifa_address
u32 ifa_address

这两个字段的值取决于地址是否分配给隧道接口。如果是,ifa_localifa_address分别是隧道的本地地址和远端地址。如果不是,则两者都包含本地接口的地址。

The values of these two fields depend on whether the address is assigned to a tunnel interface. If so, ifa_local and ifa_address are the local and remote addresses of the tunnel, respectively. If not, both contain the address of the local interface.

u32 ifa_mask
u32 ifa_mask

unsigned char ifa_prefixlen
unsigned char ifa_prefixlen

ifa_mask是与地址关联的网络掩码。ifa_prefixlen是组成网络掩码的 1 的数量。由于它们是表示相同信息的不同方式,因此两者之一通常是根据另一个计算出来的。例如,这是通过“ IP 配置”部分中描述的ipifconfig用户空间配置工具来完成的。 ip传递内核并让后者进行计算,而ifconfig则相反。内核提供了一些函数将网络掩码转换为前缀长度,反之亦然。ifa_prefixlenifa_mask

ifa_mask is the netmask associated with the address. ifa_prefixlen is the number of 1s that compose the netmask. Since they are different ways of representing the same information, one of the two is normally computed from the other. This is done, for instance, by the ip and ifconfig user-space configuration tools described in the section "IP Configuration." ip passes the kernel ifa_prefixlen and lets the latter compute ifa_mask, whereas ifconfig does the opposite. The kernel provides some functions to convert a netmask into a prefix length, and vice versa.

u32 ifa_broadcast
u32 ifa_broadcast

广播地址。

Broadcast address.

u32 ifa_anycast
u32 ifa_anycast

任播地址。

Anycast address.

unsigned char ifa_scope
unsigned char ifa_scope

地址范围。默认值为(对应于值 0),并且该字段通常通过ifconfig / ipRT_SCOPE_UNIVERSE设置为该值 ,尽管可以选择不同的值。主要的例外是127.x范围内的地址。xx,给定范围。详细信息请参见第 30 章。RT_SCOPE_HOST

Scope of the address. The default is RT_SCOPE_UNIVERSE (which corresponds to the value 0) and the field is usually set to that value by ifconfig/ip, although a different value can be chosen. The main exception is an address in the range 127.x.x.x, which is given the RT_SCOPE_HOST scope. See Chapter 30 for more details.

unsigned char ifa_flags
unsigned char ifa_flags

include/linux/rtnetlink.hIFA_F_XXX中列出了可能的位标志。这是 IPv4 使用的:

IFA_F_SECONDARY

当新地址添加到已具有同一子网的另一个地址的设备时,它会被标记为辅助地址。

其他标志由 IPv6 使用。

The possible IFA_F_XXX bit flags are listed in include/linux/rtnetlink.h. Here is the one used by IPv4:

IFA_F_SECONDARY

When a new address is added to a device that already has another address with the same subnet, it is tagged as secondary.

The other flags are used by IPv6.

char ifa_label[IFNAMSIZ]
char ifa_label[IFNAMSIZ]

主要用于向后兼容 2.0 的字符串。x内核允许使用诸如eth0:1之类的名称的别名接口。

A string used mostly for backward compatibility with 2.0.x kernels that allowed aliased interfaces with names such as eth0:1.

struct rcu_head rcu_head
struct rcu_head rcu_head

由 RCU 机制用来强制执行互斥。它完成与锁相同的工作。

Used by the RCU mechanism to enforce mutual exclusion. It accomplishes the same job as a lock.

ipv4_devconf结构

ipv4_devconf Structure

该数据结构的字段通过/proc/sys/net/ipv4/conf/中的/procipv4_devconf导出 ,用于调整网络设备的行为。每个设备都有一个实例,以及一个存储默认值 ( ) 的实例。其字段的含义在第 29 章和第 36中介绍,但 除外,它在“操作 IP 地址和配置的主要函数”部分中进行了描述。ipv4_devconf_dfltpromote_secondaries

The ipv4_devconf data structure, whose fields are exported via /proc in /proc/sys/net/ipv4/conf/, is used to tune the behavior of a network device. There is an instance for each device, plus one that stores the default values (ipv4_devconf_dflt). The meanings of its fields are covered in Chapters 29 and 36, with the exception of promote_secondaries, which is described in the section "Main Functions That Manipulate IP Addresses and Configuration."

ipv4_config结构

ipv4_config Structure

虽然ipv4_devconf结构用于存储每个设备的配置,但ipv4_config存储适用于主机的配置。

While ipv4_devconf structures are used to store per-device configuration, ipv4_config stores configuration that applies to the host.

下面简单介绍一下它的字段:

Here is a brief description of its fields:

int log_martians
int log_martians

该参数也存在于ipv4_devconf结构中。用于决定当发生特定错误时是否将警告消息打印到控制台。它的值不是直接检查的,而是通过宏检查的IN_DEV_LOG_MARTIANS,这为每个设备实例提供了更高的优先级。

This parameter is also present in the ipv4_devconf structure. It is used to decide whether to print warning messages to the console when specific errors occur. Its value is not checked directly, but via the macro IN_DEV_LOG_MARTIANS, which gives higher priority to the per-device instance.

int autoconfig
int autoconfig

不曾用过。

Not used.

int no_pmtu_disc
int no_pmtu_disc

inet_sock->pmtudisc用于初始化存储套接字 PMTU 配置的变量。有关路径 MTU 发现的更多详细信息,请参阅第 18 章。

Used to initialize the variable inet_sock->pmtudisc that stores the PMTU configuration for a socket. See Chapter 18 for more details on path MTU discovery.

软木结构

cork Structure

cork结构体定义在include/linux/ip.h [ * ]的定义内inet_sock,用于处理套接字 cork 选项(UDP_CORK对于 UDP,TCP_CORK对于 TCP)。我们在第 21 章中看到了如何使用它的字段在连续调用中维护一些上下文信息ip_append_dataip_append_page处理数据碎片。

The cork structure, defined in include/linux/ip.h [*] inside the definition of inet_sock, is used to handle the socket cork option (UDP_CORK for UDP, TCP_CORK for TCP). We saw in Chapter 21 how its fields are used to maintain some context information across consecutive invocations of ip_append_data and ip_append_page to handle data fragmentation.

下面简单介绍一下它的字段:

Here is a brief description of its fields:

unsigned int flags
unsigned int flags

目前只能设置 IPv4 使用的一个标志:IPCORK_OPT。当这个标志被设置时,意味着 中有选项 opt

Currently only one flag used by IPv4 can be set: IPCORK_OPT. When this flag is set, it means there are options in opt.

unsigned int fragsize
unsigned int fragsize

生成的数据片段的大小。这包括有效负载和 L3 标头,通常是 PMTU。

Size of the data fragments generated. This includes both payload and L3 header and is normally the PMTU.

struct ip_options *opt
struct ip_options *opt

要使用的 IP 选项。

IP options to use.

struct rtable *rt
struct rtable *rt

将用于传输 IP 数据包的路由表缓存条目。

Routing table cache entry that will be used to transmit the IP packet.

int length
int length

IP 数据包的大小(所有数据片段的总和,不包括 IP 标头)。

Size of the IP packet (sum of all the data fragments, not including IP headers).

u32 addr
u32 addr

目的IP地址。

Destination IP address.

struct flowi fl
struct flowi fl

有关连接两端的信息的收集。更多细节参见第 36 章

Collection of information about the two ends of the connection. More details are in Chapter 36.

skb_frag_t 结构

skb_frag_t Structure

我们在第 21 章中看到了分页缓冲区的样子(例如,参见该章中的图 21-5)。skb_frag_t包括识别内存页上的数据块所需的字段:

We saw in Chapter 21 what a paged buffer looks like (see, for example, Figure 21-5 in that chapter). skb_frag_t includes the fields necessary to identify a data block on a memory page:

struct page *page
struct page *page

指向内存页的指针。在 i386 上,页面大小为 4 KB。要查找任何给定体系结构上的页面大小,请在include/asm- / page.hxxx中查找 。PAGE_SIZE xxx

Pointer to the memory page. On i386, the page size is 4 KB. To find the size of a page on any given architecture xxx, look for PAGE_SIZE in include/asm- xxx / page.h.

_ _u16 page_offset
_ _u16 page_offset

偏移量,相对于页面开头(片段开始的位置)。

Offset, relative to the beginning of the page, where the fragment starts.

_ _u16 size
_ _u16 size

片段的大小。

Size of the fragment.

本书这一部分介绍的函数和变量

Functions and Variables Featured in This Part of the Book

表23-3总结了本书涉及IPv4协议的章节中介绍或引用的主要函数、变量和数据结构。

Table 23-3 summarizes the main functions, variables, and data structure introduced or referenced in the chapters of this book covering the IPv4 protocol.

表 23-3。IPv4 子系统中的函数、变量和数据结构

Table 23-3. Functions, variables, and data structures in the IPv4 subsystem

/proc 文件名

/proc filename

关联的内核变量

Associated kernel variable

ip_init

ip_init

初始化 IPv4 协议。请参阅第 19 章中的“ IP 选项”部分。

Initializes the IPv4 protocol. See the section "IP Options" in Chapter 19.

ip_rcv

ip_rcv

处理入口 IP 数据包。请参阅第 19 章中的“处理输入 IP 数据包” 部分。

Processes ingress IP packets. See the section "Processing Input IP Packets" in Chapter 19.

ip_forward

ip_forward

ip_forward_finish

ip_forward_finish

转发入口 IP 数据包或片段。请参阅第 20 章中的“转发”部分。

Forward an ingress IP packet or fragment. See the section "Forwarding" in Chapter 20.

ip_local_deliver

ip_local_deliver

ip_local_deliver_finish

ip_local_deliver_finish

将入口 IP 数据包传送到本地主机。请参阅第 20 章中的“本地交付”部分。

Deliver an ingress IP packet to the local host. See the section "Local Delivery" in Chapter 20.

ipfrag_init

ipfrag_init

初始化 IP 碎片/碎片整理子系统。

Initializes the IP Fragmentation/Defragmentation subsystem.

ip_defrag

ip_defrag

ip_find

ip_find

ip_frag_queue

ip_frag_queue

ip_frag_reasm

ip_frag_reasm

ip_frag_destroy

ip_frag_destroy

ip_expire

ip_expire

ip_evictor

ip_evictor

处理 IP 碎片整理。请参阅第 22 章中的“ IP 碎片整理”部分。

Handle IP defragmentation. See the section "IP Defragmentation" in Chapter 22.

ip_fragment

ip_fragment

ip_dont_fragment

ip_dont_fragment

getfrag

getfrag

处理 IP 碎片。请参阅第 22 章中的“ IP 分段”部分。

Handle IP fragmentation. See the section "IP Fragmentation" in Chapter 22.

ip_options_compile

ip_options_compile

ip_options_parse

ip_options_parse

ip_options_build

ip_options_build

ip_forward_options

ip_forward_options

处理 IP 选项。请参阅第 19 章中的“ IP 选项”部分。

Handle IP options. See the section "IP options" in Chapter 19.

ip_queue_xmit,

ip_queue_xmit,

ip_append_data, ip_push_pending_frames

ip_append_data, ip_push_pending_frames

由 L4 协议用来传输 IP 数据包。请参阅第 21 章中的“执行传输的关键函数”部分。

Used by L4 protocols to transmit IP packets. See the section "Key Functions That Perform Transmission" in Chapter 21.

dst_output

dst_output

根据先前路由查找的结果调用正确的传输例程。请参见第 18 章中的图 18-1

Invokes the right transmit routine according to the result of a previous routing lookup. See Figure 18-1 in Chapter 18.

ip_finish_output

ip_finish_output

ip_finish_output2

ip_finish_output2

IP层传输例程和相邻子系统之间的接口。请参阅第 21 章中的“与相邻子系统的接口”部分。

Interface between the IP layer transmission routines and the neighboring subsystem. See the section "Interface to the Neighboring Subsystem" in Chapter 21.

ip_decrease_ttl

ip_decrease_ttl

递减 IP 标头的 TTL 字段并相应更新 IP 校验和。

Decrements the IP header's TTL field and updates the IP checksum accordingly.

ip_fast_csum

ip_fast_csum

ip_send_check, ...

ip_send_check, ...

计算或更新 IP 校验和。第 18 章的“用于校验和计算的 API ”部分列出了更多这样的例程。

Compute or update an IP checksum. Many more such routines are listed in the section "APIs for Checksum Computation" in Chapter 18.

in_dev_get

in_dev_get

in_device返回网络设备的IP 配置块并增加其引用计数。

Returns the IP configuration block in_device of a network device and increments its reference count.

inet_initpeers

inet_initpeers

初始化 IP 对等子系统。

Initializes the IP peer subsystem.

inet_getpeer

inet_getpeer

inet_peer使用 IPv4 地址作为关键字搜索结构。

Searches an inet_peer structure using an IPv4 address as a key.

ip_select_ident

ip_select_ident

ip_select_ident_more

ip_select_ident_more

secure_ip_id

secure_ip_id

选择用于出口 IP 数据包的 IP ID。

Select the IP ID to use for an egress IP packet.

ip_call_ra_chain

ip_call_ra_chain

将携带路由器警报选项的入口 IP 数据包传递到感兴趣的本地原始套接字。请参阅“请参阅“ ip_forward 函数” ”在第 20 章

Hands ingress IP packets that carry the Router Alert option to the interested local Raw sockets. See the section "ip_forward function " in Chapter 20.

IP_INC_STATS

IP_INC_STATS

IP_INC_STATS_BH

IP_INC_STATS_BH

IP_INC_STATS_USER

IP_INC_STATS_USER

增量计数器用于保存 IP 流量的统计信息。请参阅“ IP 统计”部分。

Increment counters used to keep statistics on IP traffic. See the section "IP Statistics."

inet_rtm_newaddr

inet_rtm_newaddr

inet_rtm_deladdr

inet_rtm_deladdr

inet_dump_ifaddr

inet_dump_ifaddr

处理来自用户空间 IPROUTE2 包的ip addr命令。

Process ip addr commands from the user-space IPROUTE2 package.

inet_alloc_ifa

inet_alloc_ifa

inet_free_ifa

inet_free_ifa

inet_insert_ifa

inet_insert_ifa

inet_del_ifa

inet_del_ifa

inet_set_ifa

inet_set_ifa

inet_select_addr

inet_select_addr

inet_make_mask

inet_make_mask

inet_mask_len

inet_mask_len

inet_ifa_match

inet_ifa_match

 

 

添加、删除和操作本地设备上配置的 IP 地址。请参阅“操作 IP 地址和配置的主要函数”部分。

Add, remove, and manipulate the IP addresses configured on the local devices. See the section "Main functions that manipulate IP addresses and configuration."

for_primary_ifa

for_primary_ifa

for_ifa

for_ifa

浏览网络设备上配置的 IP 地址。

Browse the IP addresses configured on a network device.

rtmsg_ifa

rtmsg_ifa

生成有关本地设备 IP 地址配置更改的通知。请参阅“更改通知:rtmsg_ifa ”部分。

Generates notifications about changes to the IP address configuration of local devices. See the section "Change notification: rtmsg_ifa."

变量

Variables

 

ipv4_devconf

ipv4_devconf

ipv4_devconf_dflt

ipv4_devconf_dflt

存储一组可通过/proc文件系统针对每个设备进行调整的参数 。请参阅“通过 /proc 文件系统进行调整”部分。

Store a set of parameters that can be tuned on a per-device basis via the /proc filesystem. See the section "Tuning via /proc filesystem."

ip_frag_mem

ip_frag_mem

入口 IP 片段所持有的内存量。请参阅第 33 章中的“垃圾收集”部分。

Amount of memory held by ingress IP fragments. See the section "Garbage Collection" in Chapter 33.

ipfrag_lock

ipfrag_lock

用于实例表的锁ipq请参阅第 22 章中的“ IP 片段哈希表的组织”部分。

Lock used for the table of ipq instances. See the section "Organization of the IP Fragments Hash Table" in Chapter 22.

peer_total

peer_total

inet_peer_threshold

inet_peer_threshold

peer_total是未完成结构的数量 inet_peerinet_peer_threshold是可用于分配inet_peer 实例的最大内存量。

peer_total is the number of outstanding inet_peer structures, and inet_peer_threshold is the maximum amount of memory that can be used to allocate inet_peer instances.

peer_pool_lock

peer_pool_lock

inet_peer用于插入结构的AVL 树的锁。

Lock used for the AVL tree where inet_peer structures are inserted.

inet_peer_unused_lock

inet_peer_unused_lock

用于插入未使用的结构的列表的锁inet_peer

Lock used for the list where unused inet_peer structures are inserted.

ip_statistics

ip_statistics

存储有关 IP 流量的统计信息。请参阅“ IP 统计”部分。

Stores statistics about IP traffic. See the section "IP Statistics."

数据结构

Data structures

 

struct iphdr

struct iphdr

struct ip_options

struct ip_options

struct ipcm_cookie

struct ipcm_cookie

struct ipq

struct ipq

struct ip_mib

struct ip_mib

struct inet_peer

struct inet_peer

struct in_device

struct in_device

struct ipv4_devconf

struct ipv4_devconf

struct ipv4_config

struct ipv4_config

struct in_ifaddr

struct in_ifaddr

struct cork

struct cork

IPv4 使用的主要数据结构。第 19 章对它们进行了简要介绍,本章将对其进行详细描述。

Main data structures used by IPv4. They are briefly introduced in Chapter 19 and are described in detail in this chapter.

本书这一部分介绍的文件和目录

Files and Directories Featured in This Part of the Book

net /ipv4目录包含的文件比图 23-4中列出的文件要多,但它们在其他章节中介绍,包括第 VI 部分和第 VII 部分的章节。

The net/ipv4 directory contains more files than the ones listed in Figure 23-4, but they are covered in other chapters, including the chapters comprising Parts VI and VII.

本书这一部分中的文件和目录

图 23-4。本书这一部分中的文件和目录

Figure 23-4. Files and directories featured in this part of the book




[ * ] net/ipv4/inetpeer.c顶部的注释 非常清楚且不言自明。

[*] The comment at the top of net/ipv4/inetpeer.c is quite clear and self-explanatory.

[ * ]根据源代码中的注释,该问题与 RFC 1144 中描述的 TCP/IP 标头压缩算法的实现有关。

[*] According to the comment in the source code, the issue has to do with the implementation of the TCP/IP header compression algorithm described in RFC 1144.

[ * ] MIB,如前所述,代表管理信息库,用于指代对象(通常是计数器)的集合。

[*] MIB, as mentioned earlier, stands for Management Information Base, and is used to refer to a collection of objects (typically counters).

[ * ] IPv6 将“隧道封装限制”定义为嵌套封装的最大数量。请参阅 RFC 2473 的第 6.6 节。

[*] IPv6 defines the "tunnel encapsulation limit" as the maximum number of nested encapsulations. See section 6.6 of RFC 2473.

[ * ]如果您想了解什么是 NAT 友好的协议或应用程序,您可以阅读 RFC 3235。

[*] You can read RFC 3235 if you would like to see what is considered a NAT-friendly protocol or application.

[ * ]cork IPv6在 include/linux/ipv6.h中定义了自己的版本。

[*] IPv6 defines its own version of cork in include/linux/ipv6.h.

第 24 章第四层协议和原始 IP 处理

Chapter 24. Layer Four Protocol and Raw IP Handling

本章介绍 L3 和 L4 协议之间的接口。这里考虑的唯一 L3 协议是 IP。L4 协议包括熟悉的 TCP、UDP 和 ICMP,以及其他几个协议。由于篇幅和复杂性的原因,本书没有讨论 L4 协议。然而,本章解释了当应用程序通过原始 IP 处理自己的 L4(有时是 L3)处理时会发生什么。

This chapter describes the interface between L3 and L4 protocols. The only L3 protocol considered here is IP. The L4 protocols include the familiar TCP, UDP, and ICMP, along with several other ones. The L4 protocols are not covered in this book for reasons of space and complexity. However, this chapter explains what happens when applications handle their own L4 (and sometimes L3) processing through raw IP.

本章特别解释:

In particular, this chapter explains:

  • L4 协议如何向内核注册并告诉内核它们对哪种流量感兴趣

  • How L4 protocols register with the kernel and tell the kernel what kind of traffic they are interested in

  • 入口数据包如何传递到正确的 L4 协议处理程序

  • How ingress packets are passed to the correct L4 protocol handler

  • 应用程序如何告诉内核让应用程序处理标头

  • How applications tell the kernel to let the application process headers

我们在第 21 章中看到了L4 协议用于传输 IP 数据报的函数。由于本书重点关注 IP,因此本章仅涵盖 IP 之上的 L4 协议。本章介绍 IPv4 接口,然后简要说明 IPv6 的不同之处。

We saw in Chapter 21 the functions that L4 protocols use to transmit an IP datagram. Since this book focuses on IP, this chapter covers only those L4 protocols that sit on top of IP. The chapter describes the IPv4 interface and then briefly shows where IPv6 differs.

可用的 L4 协议

Available L4 Protocols

一些关键的 L4 协议被静态编译到内核中。一些不太常见的协议可以编译为模块。表 24-1显示了静态编译的协议。

A few key L4 protocols are statically compiled into the kernel. Several less-common protocols can be compiled as modules. Table 24-1 shows the protocols that are statically compiled in.

表 24-1。协议静态编译到内核中

Table 24-1. Protocols statically compiled into the kernel

协议

Protocol

RFC#(年份)

RFC# (Year)

UDP协议

UDP

768(1980)

768(1980)

ICMP

ICMP

792(1981)

792(1981)

传输控制协议

TCP

793(1981)

793(1981)

表24-2列出了第二类中的一些协议。可以从内核配置中的“网络支持→网络选项”部分将它们添加到内核中。

Table 24-2 lists some of the protocols in the second category. They can be added to the kernel from the section "Networking Support → Networking Options" in the kernel configuration.

表24-2。作为模块实现的协议

Table 24-2. Protocols implemented as modules

协议

Protocol

RFC#(年份)

RFC# (Year)

互联网组管理协议 (IGMP)

Internet Group Management Protocol (IGMP)

版本1:1112(1989)

Version 1: 1112(1989)

版本 2:2236(1997)

Version 2: 2236(1997)

版本 3:3376(2002)

Version 3: 3376(2002)

流控制传输协议 (SCTP)

Stream Control Transmission Protocol (SCTP)

2960(2000)

2960(2000)

协议独立组播,版本 1 (PIMv1)和版本 2 (PIMv2)

Protocol Independent Multicast, version 1 (PIMv1) and version 2 (PIMv2)

2362(1998)

2362(1998)

IPsec 套件:IP 身份验证标头协议 (AH), IP 封装安全有效负载协议 (ESP) , IP 有效负载压缩协议 (IPcomp)

IPsec suite: IP Authentication Header Protocol (AH) , IP Encapsulating Security Payload Protocol (ESP) , IP Payload Compression Protocol (IPcomp)

啊:2402(1998)

AH: 2402(1998)

ESP:2406(1998)

ESP: 2406(1998)

IPcomp:3173(2001)

IPcomp: 3173(2001)

通用路由封装 (GRE)

Generic Routing Encapsulation (GRE)

2784(2000)

2784(2000)

IPv4 over IPv4 隧道 (IPIP)

IPv4-over-IPv4 tunnels (IPIP)

1853(1995)

1853(1995)

IPv6 之上的 IPv6

IPv6 over IPv6

2473(1998)

2473(1998)

简单互联网过渡(IPv6-over-IPv4 隧道,SIT)

Simple Internet Transition (IPv6-over-IPv4 tunnel, SIT)

1933年(1996年)

1933(1996)

其他协议可用于 Linux 内核,但要么在用户空间中实现(例如路由协议),要么作为内核补丁提供,因为它们尚未集成到核心内核中。

Other protocols are available for the Linux kernel but are either implemented in user space (routing protocols are an example) or are available as kernel patches because they are not yet integrated into the core kernel.

图 24-1显示了 L4 协议如何依赖于 L3 协议。三个主要协议(ICMP、UDP 和 TCP)以及 IPsec 套件都有 IPv6 对应协议。图 24-1中没有 IGMPv6 ,因为它的功能是作为 ICMPv6 的一部分实现的。

Figure 24-1 shows how the L4 protocols rest on L3 protocols. The three main protocols (ICMP, UDP, and TCP), as well as the IPsec suite, have IPv6 counterparts. There is no IGMPv6 in Figure 24-1 because its functionality is implemented as part of ICMPv6.

Linux 内核中实现的 IPv4 和 IPv6 之上的 L4 协议

图 24-1。Linux 内核中实现的 IPv4 和 IPv6 之上的 L4 协议

Figure 24-1. L4 protocols on top of IPv4 and IPv6 that are implemented in the Linux kernel

注意表24-2中的最后四项是隧道协议。它们的 ID 标识 L3 协议。例如,IPIP 协议用于在 IPv4 数据报内传输 IPv4 数据报。请注意,封装 IP 数据报时分配给 IPv4 标头协议字段的值与当以太网有效负载是 IP 数据报时用于初始化以太网标头协议字段的值无关。尽管这两个字段引用相同的协议(IPv4),但它们属于两个不同的域:一个是L3协议标识符,而另一个是L4协议标识符。

Note that the last four items in Table 24-2 are tunneling protocols . Their IDs identify an L3 protocol. For example, the IPIP protocol is used to transport IPv4 datagrams inside IPv4 datagrams. Note that the value assigned to the protocol field of the IPv4 header when it encapsulates an IP datagram has nothing to do with the value used to initialize the protocol field of an Ethernet header when the Ethernet payload is an IP datagram. Even though the two fields refer to the same protocol (IPv4), they belong to two different domains: one is an L3 protocol identifier, whereas the other is an L4 protocol identifier.

L4协议注册

L4 Protocol Registration

基于 IPv4 的 L4 协议由net_protocol数据结构定义,定义在include/net/protocol.h中,由以下三个字段组成:

The L4 protocols that rest on IPv4 are defined by net_protocol data structures, defined in include/net/protocol.h, which consist of the following three fields:

int (*handler)(struct sk_buff *skb)
int (*handler)(struct sk_buff *skb)

由协议注册为传入数据包处理程序的函数。这将在“ L3 到 L4 传送:ip_local_deliver_finish ”部分中进一步讨论。可以让协议为 IPv4 和 IPv6 共享相同的处理程序(例如 SCTP)。

Function registered by the protocol as the handler for incoming packets. This is discussed further in the section "L3 to L4 Delivery: ip_local_deliver_finish." It is possible to have protocols that share the same handler for both IPv4 and IPv6 (e.g., SCTP).

void (*err_handler)(struct sk_buff *skb, u32 info)
void (*err_handler)(struct sk_buff *skb, u32 info)

ICMP 协议处理程序用于通知 L4 协议有关 ICMP UNREACHABLE 消息的接收的函数。我们将在第 35 章看到Linux 系统何时生成 ICMP UNREACHABLE 消息,并且在第 25 章我们将看到ICMP 协议如何使用err_handler.

Function used by the ICMP protocol handler to inform the L4 protocol about the reception of an ICMP UNREACHABLE message. We will see in Chapter 35 when a Linux system generates ICMP UNREACHABLE messages, and we will see in Chapter 25 how the ICMP protocol uses err_handler.

int no_policy
int no_policy

该字段在网络堆栈中的某些关键点进行咨询,用于使协议免于 IPsec 策略检查: 1 表示不需要检查该协议的 IPsec 策略。不要将结构no_policy中的字段net_protocol 与结构中具有相同名称的字段混淆ipv4_devconf:前者适用于协议;后者适用于协议。后者适用于设备。有关如何使用的信息,请参阅“ L3 到 L4 传递:ip_local_deliver_finish ”和“ IPsec ”部分。no_policy

This field is consulted at certain key points in the network stack and is used to exempt protocols from IPsec policy checks: 1 means that there is no need to check the IPsec policies for the protocol. Do not confuse the no_policy field of the net_protocol structure with the field bearing the same name in the ipv4_devconf structure: the former applies to a protocol; the latter applies to a device. See the sections "L3 to L4 Delivery: ip_local_deliver_finish" and "IPsec" for how no_policy is used.

include/linux/in.h文件包含定义为符号的 L4 协议列表IPPROTO_ XXX。(有关更完整的列表,请参阅/etc/protocols文件或 RFC 1700 及其后续 RFC。)L4 协议标识符的最大值为 2 8- 1 或 255,因为 IP 标头中的字段分配用于指定L4协议是8位。最高数字 255 是为原始 IP 保留的IPPROTO_RAW

The include/linux/in.h file contains a list of L4 protocols defined as IPPROTO_ XXX symbols. (For a more complete list, see the /etc/protocols file, or RFC 1700 and its successor RFCs.) The maximum value for an L4 protocol identifier is 28-1 or 255, because the field in the IP header allocated to specify the L4 protocol is 8 bits. The highest number, 255, is reserved for Raw IP, IPPROTO_RAW.

并非符号列表中定义的所有协议都在内核层处理;其中一些(特别是资源预留协议,或 RSVP,以及各种路由协议)通常在用户空间中处理。例如,这就是为什么 RSVP 和 OSPF 等路由协议未包含在上一节中内核支持的 L4 协议列表中的原因。

Not all of the protocols defined in the list of symbols are handled at the kernel layer; some of them (notably Resource Reservation Protocol, or RSVP, and the various routing protocols) are usually handled in user space. This is, for example, why RSVP and routing protocols like OSPF are not included in the list of L4 protocols supported by the kernel that is in the previous section.

注册:inet_add_protocol 和 inet_del_protocol

Registration: inet_add_protocol and inet_del_protocol

协议将自己注册到inet_add_protocol函数中,并且当作为模块实现时,协议将自己注销到函数中inet_del_protocol这两个例程均在net/ipv4/protocol.c中定义。

Protocols register themselves with the inet_add_protocol function and, when implemented as modules, unregister themselves with the inet_del_protocol function. Both routines are defined in net/ipv4/protocol.c.

所有inet_protocol向内核注册的 L4 协议的结构都被插入到名为 的表中,如图 24-2inet_protos所示。在早期版本的内核中,这是一个哈希表,并且该词 仍然出现在处理该表的代码中,但目前它是一个简单的平面数组,其中每个可能的 256 个协议都有一个项目。/etc/protocols中的协议号 是表中插入协议的位置。如果您想了解 2.4 内核中如何将该表处理为哈希表,请查看该函数的 2.4 源代码。图24-2haship_run_ipprot显示最常见协议的数字和缩写;例如,ICMP 是协议 1,占用inet_protos表中的槽位 1。

All of the inet_protocol structures of the L4 protocols registered with the kernel are inserted into a table named inet_protos, represented in Figure 24-2. In earlier versions of the kernel, this was a hash table, and the word hash still appears in the code that handles the table, but currently it is a simple flat array with one item for each of the possible 256 protocols. The protocol number from /etc/protocols is the slot in the table where the protocol is inserted. If you'd like to see how the table was handled as a hash table in the 2.4 kernel, look in the 2.4 sources at the ip_run_ipprot function. Figure 24-2 shows the numbers and initials of the most common protocols; for instance, ICMP is protocol 1 and occupies slot 1 in the inet_protos table.

IPv4协议表

图 24-2。IPv4协议表

Figure 24-2. IPv4 protocol table

对表的并发访问inet_protos是这样管理的:

Concurrent accesses to the inet_protos table are managed in this way:

  • 读写访问(即inet_add_protocolinet_del_protocol)通过自旋锁进行序列化 inet_proto_lock

  • Read-write accesses (i.e., inet_add_protocol and inet_del_protocol) are serialized with the inet_proto_lock spin lock.

  • 只读访问(即ip_local_deliver_finish;参见下一节)受 rcu_read_lock/保护rcu_read_unlock

  • Read-only accesses (i.e., ip_local_deliver_finish; see the next section) are protected with rcu_read_lock/rcu_read_unlock.

inet_del_protocol,它可能会删除 RCU 读取器当前持有的表中的条目,并调用synchronize_net等待所有当前正在执行的 RCU 读取器在返回之前完成其临界区。IPv6 上的协议使用另一个哈希表。请注意,IPv6 也出现在 IPv4inet_protos表中:内核可以通过 IPv4 建立 IPv6 隧道(也称为 SIT,即简单互联网转换)。请参阅“ IPv6 与 IPv4 ”部分。

inet_del_protocol, which may remove an entry of the table currently held by an RCU reader, calls synchronize_net to wait for all the currently executing RCU readers to complete their critical section before returning. There is another hash table used by protocols that rest on IPv6. Note that IPv6 appears in the IPv4 inet_protos table as well: the kernel can tunnel IPv6 over IPv4 (also called SIT, for Simple Internet Transition). See the section "IPv6 Versus IPv4."

inet_init如上一节所述,ICMP、UDP 和 TCP 协议始终是内核的一部分,因此在启动时通过net/ipv4/af_inet.c静态添加到哈希表中 。以下摘录显示了它们结构的定义以及inet_add_protocol注册它们的实际调用:

As mentioned in the previous section, the ICMP, UDP, and TCP protocols are always part of the kernel and therefore are statically added to the hash table at boot time by inet_init in net/ipv4/af_inet.c. The following excerpts show the definitions of their structures and the actual inet_add_protocol calls that register them:

#ifdef CONFIG_IP_MULTICAST
静态结构net_protocol igmp_protocol = {
    .handler = igmp_rcv,
};
#万一

静态结构 net_protocol tcp_protocol = {
    .handler = tcp_v4_rcv,
    .err_handler = tcp_v4_err,
    .no_policy = 1,
};

静态结构net_protocol udp_protocol = {
    .handler = udp_rcv,
    .err_handler = udp_err,
    .no_policy = 1,
};

静态结构 net_protocol icmp_protocol = {
    .handler = icmp_rcv,
};

静态 int _ _init inet_init(void)
{
...

    /*
     * 添加所有基本协议。
     */

    if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
        printk(KERN_CRIT "inet_init: 无法添加 ICMP 协议\n");
    if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
        printk(KERN_CRIT "inet_init: 无法添加 UDP 协议\n");
    if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
        printk(KERN_CRIT "inet_init: 无法添加 TCP 协议\n");
#ifdef CONFIG_IP_MULTICAST
    if (inet_add_protocol(&igmp_protocol, IPPROTO_IGMP) < 0)
        printk(KERN_CRIT "inet_init: 无法添加 IGMP 协议\n");
#万一
...
}
#ifdef CONFIG_IP_MULTICAST
static struct net_protocol igmp_protocol = {
    .handler =    igmp_rcv,
};
#endif

static struct net_protocol tcp_protocol = {
    .handler =      tcp_v4_rcv,
    .err_handler =  tcp_v4_err,
    .no_policy =    1,
};

static struct net_protocol udp_protocol = {
    .handler =      udp_rcv,
    .err_handler =  udp_err,
    .no_policy =    1,
};

static struct net_protocol icmp_protocol = {
    .handler =    icmp_rcv,
};

static int _ _init inet_init(void)
{
...

    /*
     *    Add all the base protocols.
     */

    if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
        printk(KERN_CRIT "inet_init: Cannot add ICMP protocol\n");
    if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
        printk(KERN_CRIT "inet_init: Cannot add UDP protocol\n");
    if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
        printk(KERN_CRIT "inet_init: Cannot add TCP protocol\n");
#ifdef CONFIG_IP_MULTICAST
    if (inet_add_protocol(&igmp_protocol, IPPROTO_IGMP) < 0)
        printk(KERN_CRIT "inet_init: Cannot add IGMP protocol\n");
#endif
...
}

仅当内核编译为支持 IP 多播时,才会注册 IGMP 处理程序。

The IGMP handler is registered only when the kernel is compiled with support for IP multicast.

作为如何动态注册其他协议的示例,以下快照取自 Zebra 用户空间路由守护程序对开放最短路径优先 IGP (OSPFIGP) 协议的实现。该代码取自Zebra 软件包中的ospfd/ospf_network.c文件。socket调用有效地将用户空间守护程序注册到内核,为内核提供了发送使用第三个参数中指定的协议的入口数据包的位置。该协议是IPPROTO_OSPFIGP,一个等于 89 的符号,即 /etc/protocols中分配给 OSPFIGP 的编号。另请注意,套接字类型为 SOCK_RAW,因为数据包具有 OSPFIGP 协议知道如何处理的私有格式。原始套接字的使用将在稍后的“原始套接字和原始 IP ”部分中进行描述。

As an example of how other protocols are dynamically registered, the following snapshot is taken from the Zebra user-space routing daemon's implementation of the Open Shortest Path First IGP (OSPFIGP) protocol . The code is taken from the ospfd/ospf_network.c file in the Zebra package. The socket call effectively registers the user-space daemon with the kernel, giving the kernel a place to send ingress packets that use the protocol specified in the third argument. This protocol is IPPROTO_OSPFIGP, a symbol equal to 89, the number assigned to OSPFIGP in /etc/protocols. Note also that the socket type is SOCK_RAW, because packets have a private format that the OSPFIGP protocol knows how to handle. The use of raw sockets is described later in the section "Raw Sockets and Raw IP."

整数
ospf_serv_sock(结构接口 *ifp,int 系列)
{
  int ospf_sock;
  int ret,tos;
  结构体ospf_interface *oi;

  ospf_sock = 套接字(系列、SOCK_RAW、IPPROTO_OSPFIGP);
  如果(ospf_sock < 0)
  {
    zlog_warn ("ospf_serv_sock: 套接字: %s", strerror (errno));
    返回ospf_sock;
  }
  …………
}
int
ospf_serv_sock (struct interface *ifp, int family)
{
  int ospf_sock;
  int ret, tos;
  struct ospf_interface *oi;

  ospf_sock = socket (family, SOCK_RAW, IPPROTO_OSPFIGP);
  if (ospf_sock < 0)
  {
    zlog_warn ("ospf_serv_sock: socket: %s", strerror (errno));
    return ospf_sock;
  }
  ... ... ...
}

对于每个 L4 协议,内核空间中只能有一个处理程序(但用户空间中可以存在多个处理程序,如稍后“原始套接字和原始 IP ”部分中所述)。inet_add_protocol当调用它来为已有的 L4 协议安装处理程序时,会抱怨(返回 -1)。

For each L4 protocol there can be only one handler in kernel space (but multiple handlers could be present in user space, as discussed later in the section "Raw Sockets and Raw IP"). inet_add_protocol complains (returns -1) when it is called to install a handler for an L4 protocol that already has one.

L3 到 L4 传送:ip_local_deliver_finish

L3 to L4 Delivery: ip_local_deliver_finish

的主要工作在net/ipv4/ip_input.cip_local_deliver_finish中定义,并在第 20 章中进行了简要描述 ,它是根据输入 IP 数据包标头的字段找到正确的协议处理程序,并将数据包交给该处理程序。同时,需要处理原始 IP 并强制执行安全策略(如果已配置)。后两项任务将在后面的章节中介绍。protocolip_local_deliver_finish

The main job of ip_local_deliver_finish, which is defined in net/ipv4/ip_input.c and was briefly described in Chapter 20, is to find the correct protocol handler based on the protocol field of the input IP packet's header and to hand the packet to that handler. At the same time, ip_local_deliver_finish needs to handle raw IP and enforce security policies if they are configured. These latter two tasks are described in later sections.

当然,大多数数据包都与 L4 协议相关联。从图 24-3ip_local_deliver_finish中阴影部分所示的 IP 标头的 8 位字段中提取该协议的编号。如果该表不包含该数字的处理程序(即,如果内核收到一个 L4 协议的数据包,而该协议从未以上一节中所示的方式注册自身),并且没有原始套接字对该数据包感兴趣,则数据包被丢弃,并向发送方发送回 ICMP 不可达消息。inet_protos

Most packets, of course, are associated with an L4 protocol. ip_local_deliver_finish extracts the number of this protocol from the 8-bit field of the IP header shown with shading in Figure 24-3. If the inet_protos table doesn't contain a handler for this number—that is, if the kernel received a packet for an L4 protocol that never registered itself in the manner shown in the previous section—and no raw socket is interested in the packet, the packet is dropped and an ICMP unreachable message is sent back to the sender.

然而,除了内核处理之外,应用程序还可以处理数据包。这种处理可以代替内核处理或在内核处理之外进行。因此,无论是否注册了内核处理程序,该ip_local_deliver_finish函数都会检查应用程序是否已设置原始套接字来处理协议,如果是,则克隆数据包以将其移交给应用程序。如图 24-4所示。最后,无论数据包是通过注册的 L4 协议还是原始 IP 进行处理,都可能必须调用其他协议(例如 IPsec 套件中的协议)。

In addition to kernel handling, however, applications can also handle packets. This handling can be done instead of the kernel handling or in addition to it. Therefore, regardless of whether a kernel handler is registered, the ip_local_deliver_finish function always checks whether an application has set up a raw socket to handle the protocol and, if so, makes a clone of the packet to hand over to the application. This is depicted in Figure 24-4. Finally, whether the packet is processed through a registered L4 protocol or Raw IP, other protocols such as those in the IPsec suite might have to be invoked.

IPv4 标头和协议标识符字段

图 24-3。IPv4 标头和协议标识符字段

Figure 24-3. IPv4 header and protocol identifier field

图 24-4显示了该函数的基本操作。

Figure 24-4 shows the basic operation of the function.

该函数开始如下:

The function starts as follows:

静态内联 int ip_local_deliver_finish(struct sk_buff *skb)
{
    int ihl = skb->nh.iph->ihl*4;
    _ _skb_pull(skb, ihl);
    skb->h.raw = skb->数据;
static inline int ip_local_deliver_finish(struct sk_buff *skb)
{
    int ihl = skb->nh.iph->ihl*4;
    _ _skb_pull(skb, ihl);
    skb->h.raw = skb->data;

skb->hnetif_receive_skb在(第 10 章中描述)中初始化以指向 IP 标头的开头。此时,内核不再需要 IP 标头,因为它已经完成了 IP 层并将数据包传递到下一个更高层。因此, _ _skb_pull此处显示的调用缩短了数据包的数据部分以忽略 L3 标头,并且以下语句将指针更新为 指针skb的值skb->data,该指针指向 L4 标头的开头。

skb->h was initialized in netif_receive_skb (described in Chapter 10) to point to the beginning of the IP header. At this point, the kernel no longer needs the IP header because it is finished with the IP layer and is delivering the packet to the next higher layer. Therefore, the _ _skb_pull call shown here shortens the data portion of the packet to ignore the L3 header, and the following statement updates the pointer in skb to the value of the skb->data pointer, which points to the beginning of the L4 header.

ip_local_deliver_finish 函数

图 24-4。ip_local_deliver_finish 函数

Figure 24-4. ip_local_deliver_finish function

协议 ID 是从skb->nh.iph->protocol变量中提取的,该变量指向 IP 标头的协议字段,如图 24-3中的阴影所示。

The protocol ID is extracted from the skb->nh.iph->protocol variable, which points to the protocol field of the IP header, shaded in Figure 24-3.

图 24-4显示ip_local_deliver_finish可以调用多个协议处理程序 ( ipprot->handler)。有人可能会问这是怎么发生的,因为如图 24-3所示,每个数据包标头都有空间只列出一个 L4 协议。调用多个 L4 协议的一个示例是 IPsec 的使用。使用 IPsec,内核需要在将数据包交给真正的 L4 协议之前处理可能的 AH、ESP 和 IPcomp 标头。图 24-5显示了 IPsec 套件协议使用的标头和标尾所在的位置。该图还显示,ip_local_deliver_finish咨询 IPsec 安全策略xfrm4_policy_check在几个地方。因为本书中没有讨论 IPsec,所以我们假设主机上没有 IPsec 配置,因此两个调用都返回xfm4_policy_check失败。

Figure 24-4 shows that ip_local_deliver_finish may invoke more than one protocol handler (ipprot->handler). One might ask how this could happen, because, as shown in Figure 24-3, each packet header has space to list only one L4 protocol. An example where multiple L4 protocols are invoked is the use of IPsec. With IPsec, the kernel needs to process possible AH, ESP, and IPcomp headers before handing the packet to the real L4 protocol. Figure 24-5 shows where the headers and trailers used by the protocols of the IPsec suite sit. The figure also shows that ip_local_deliver_finish consults the IPsec security policies with xfrm4_policy_check in a couple of places. Because IPsec is not discussed in this book, let's just assume there is no IPsec configuration on the host and therefore that both calls to xfm4_policy_check return failure.

IPsec 标头/尾部位置

图 24-5。IPsec 标头/尾部位置

Figure 24-5. IPsec headers/trailers locations

请注意,在图 24-4ip_local_deliver_finish中,协议处理程序成功处理后不会释放缓冲区:协议处理程序会处理它。

Note in Figure 24-4 that ip_local_deliver_finish does not free the buffer after successful processing by the protocol handler: the protocol handler takes care of it.

原始套接字和原始 IP

Raw Sockets and Raw IP

并非所有 L4 协议都是在内核空间中实现的。例如,应用程序可以使用原始套接字,如 Zebra 代码前面所示,绕过内核空间中的 L4。使用原始套接字时,应用程序向内核提供已包含所有必要的 L4 信息的 IP 数据包。这使得既可以在用户空间中实现新的 L4 协议,又可以在用户空间中对通常在内核空间中处理的 L4 协议进行额外的处理。因此,一些L4协议完全在内核空间中实现(例如TCP和UDP),一些完全在用户空间中实现(例如OSPF),以及一些部分在内核空间中部分在用户空间中实现(例如ICMP)。图24-6(a) (b)(c)显示了这三种情况,图24-6(d)是图24-6(b)的特殊情况。以下是对图中发生的情况的解释:

Not all the L4 protocols are implemented in kernel space. For instance, an application can use raw sockets , as shown earlier in the Zebra code, to bypass L4 in kernel space. When using raw sockets, the applications supply the kernel with IP packets that already include all the necessary L4 information. This makes it possible both to implement new L4 protocols in user space and to do extra processing in user space on those L4 protocols normally processed in kernel space. Some L4 protocols, therefore, are implemented entirely in kernel space (e.g., TCP and UDP), some entirely in user space (e.g., OSPF), and some partially in kernel space and partially in user space (e.g., ICMP). Figure 24-6(a)(b)(c) shows the three cases, and Figure 24-6(d) is a special case of Figure 24-6(b). Here is an explanation of what's going on in the figure:

  • (a) 网络浏览器与远程网络服务器通信。在这种情况下,通信是通过一个或多个 TCP 套接字完成的。TCP 在内核空间中实现:浏览器和 Web 服务器仅向内核传递 TCP 有效负载,内核负责处理 TCP 和 IP 标头。

  • (a) A web browser communicates with a remote web server. In this case, the communication is done via one or more TCP sockets. TCP is implemented in kernel space: the browser and the web server pass the kernel the TCP payload only, and the kernel takes care of the TCP and IP headers.

  • (b) 两个运行 OSPF 守护进程的路由器相互通信。OSPF 协议在用户空间中实现,并向内核传递 L4 标头。[ * ]这是使用原始套接字的示例。有关原始套接字如何融入堆栈的信息,请参阅第 13 章。

  • (b) Two routers running OSPF daemons talk to each other. The OSPF protocol is implemented in user space, and passes the kernel the L4 header.[*] This is an example of the use of raw sockets. See Chapter 13 for information on how raw sockets fit into the stack.

  • (c) 一台主机 ping 另一台主机。请求组件在用户空间中实现。回复组件在内核空间中实现。

  • (c) One host pings another one. The request component is implemented in user space. The reply component is implemented in kernel space.

  • (d) 主机运行traceroute来执行网络故障排除。L3 和 L4 标头均由应用程序处理。它将其 L4 协议简单地指定为 RAW IP,并IP_HDRINCL在套接字上设置(包含标头)选项。[ ]请参阅第 21 章了解 IP 如何处理原始 IP 协议。

  • (d) A host runs traceroute to perform network troubleshooting. Both the L3 and L4 headers are processed by the application. It specifies its L4 protocol simply as RAW IP and sets the IP_HDRINCL (header included) option on the socket.[] See Chapter 21 for how the raw IP protocol is taken care of by IP.

当来自用户空间框的箭头绕过内核空间中的“L4”框时,这意味着它是原始传输。

When the arrow from the user-space box bypasses the "L4" box in kernel space, it means it is a raw transmission.

ICMP 是部分在用户空间、部分在内核空间实现的协议示例。当您 ping 主机时,ping应用程序会生成 ICMP 数据包并将它们作为 IP 数据包传递到内核。内核不会触及 ICMP 标头。然而,接收主机ICMP_ECHO_REQUEST通过回复ICMP_ECHO_REPLY消息来在内核空间中进行处理。

ICMP is an example of a protocol that is implemented partially in user space and partially in kernel space. When you ping a host, the ping application generates ICMP packets and passes them to the kernel as IP packets. The kernel does not touch the ICMP header. However, the receiving host processes ICMP_ECHO_REQUEST in kernel space by replying back with an ICMP_ECHO_REPLY message.

将原始输入数据报传送到接收方应用程序

Delivering Raw Input Datagrams to the Recipient Application

在学习编程的时候,你可能接触过socket call。我们将在这里回顾它以展示它与原始协议的关系。当应用程序打开套接字时,调用需要指定系列、套接字类型和协议标识符。套接字和协议都可以是原始类型。我们来看看两者之间的关系。这是系统调用的原型socket

When learning programming, you were probably exposed to the socket call. We'll review it here to show its relation to raw protocols. When an application opens a socket, the call needs to specify the family, socket type, and protocol identifier. Both the socket and the protocol can be of type raw. Let's see the relationship between the two. This is the prototype of the socket system call:

socket(int family, int type, int protocol)

family是地址族;允许的值 AF_ XXX列在include/linux/socket.h中(用于 TCP/IP 的值为AF_INET)。type是套接字类型;允许的值SOCK_ XXX列在include/linux/net.h中。protocol是L4协议标识符;IP 协议的允许值IPPROTO_ XXX列在include/linux/in.h中。

family is the address family; the allowed values AF_ XXX are listed in include/linux/socket.h (the value used for TCP/IP is AF_INET). type is the socket type; the allowed values SOCK_ XXX are listed in include/linux/net.h. protocol is the L4 protocol identifier; the allowed values IPPROTO_ XXX of IP protocols are listed in include/linux/in.h.

协议的内核与用户空间实现

图 24-6。协议的内核与用户空间实现

Figure 24-6. Kernel versus user-space implementations of protocols

当您打开类型为 的套接字SOCK_RAW并且任何选定的协议分配了整数值时P,您的应用程序将传递符合以下条件的所有入口数据包:

When you open a socket of type SOCK_RAW and any chosen protocol assigned the integer value P, your application will be passed all ingress packets matching the following criteria:

  • IP 标头中的 L4 协议标识符是 P

  • The L4 protocol identifier in the IP header is P.

  • 当套接字绑定到目标 IP 地址时,数据包中的源 IP 地址必须与其匹配。

  • When the socket is bound to a destination IP address, the source IP address in the packets must match it.

  • 当套接字绑定到本地 IP 地址时,数据包中的目标 IP 地址必须与其匹配。

  • When the socket is bound to a local IP address, the destination IP address in the packets must match it.

多个套接字可以满足这些标准,因此单个原始 IP 数据包可以传送到多个应用程序。例如,考虑从两个不同的终端 ping 相同的远程 IP 地址,如图24-7所示。

More than one socket can match these criteria, so a single raw IP packet can be delivered to multiple applications. For instance, think about pinging the same remote IP address from two different terminals, as shown in Figure 24-7.

对同一目标 IP 地址的并发 ping

图 24-7。对同一目标 IP 地址的并发 ping

Figure 24-7. Concurrent pings to the same destination IP address

两个 ping 实例如何区分回复,以便它们不会被另一个实例的流量所混淆?L4 协议必须在其标头或有效负载中包含区分应用程序所需的信息。例如,由ping命令发送的 ICMP ECHO REQUEST 消息将其 ICMP 标头identifier字段初始化为发送者的进程 ID (pid)。该字段允许ping应用程序识别将由接收者发回的输入 ECHO REPLY ICMP 消息。ICMP 标头的序列号字段被初始化为一个计数器,该计数器在每次传输后ping都会递增。该计数器将允许ping将入口 ICMP ECHO REQUEST 消息与其关联的 ICMP ECHO REPLY 消息进行匹配。在下面的示例中,该计数器被打印为字段icmp_seq

How can the two ping instances distinguish the replies so that they are not confused by the traffic meant for the other instance? The L4 protocol must include the information needed to distinguish the applications in its header or payload. For example, the ICMP ECHO REQUEST messages sent by the ping command get their ICMP header's identifier field initialized to the sender's process ID (pid). This field is what will allow the ping application to recognize the input ECHO REPLY ICMP messages that will be sent back by the recipient. The sequence number field of the ICMP header is initialized to a counter that ping increments after each transmission. This counter will allow ping to match ingress ICMP ECHO REQUEST messages with their associated ICMP ECHO REPLY messages. In the example below, this counter is printed as the icmp_seq field.

# ping www.oreilly.com 
PING www.oreilly.com (208.201.239.36) 56(84) 字节数据来自www.oreillynet.com (208.201.239.36)
的 64 字节:icmp_seq=0 ttl=50 time=245 ms来自www.oreillynet.com (208.201.239.36)
的 64 字节:icmp_seq=1 ttl=50 time=244 ms
...
# ping www.oreilly.com
PING www.oreilly.com (208.201.239.36) 56(84) bytes of data
64 bytes from www.oreillynet.com (208.201.239.36): icmp_seq=0 ttl=50 time=245 ms
64 bytes from www.oreillynet.com (208.201.239.36): icmp_seq=1 ttl=50 time=244 ms
...

有关 ICMP 的更多详细信息,请参阅第 25 章。请注意,将数据包传递给多个应用程序并让应用程序筛选出不需要的数据包(而不是让内核通过端口进行筛选)会产生额外的开销。由于这种开销,需要大量复用/解复用的新协议通常不会在使用原始 IP 的用户空间中实现。

For more details on ICMP, see Chapter 25. Note the extra overhead involved in delivering the packet to multiple applications and having the applications screen out the unwanted packets, instead of having the kernel do the screening through a port. Because of this overhead, new protocols that need heavy multiplexing/demultiplexing are not normally implemented in user space using raw IP.

简而言之,每当内核收到一个携带未由内核处理的 L4 协议的数据包时,所有注册该协议的套接字都会收到该数据包的副本。由他们决定接受或丢弃数据包。这意味着应用程序必须有一种方法来了解它们收到的数据包是否是发给它们的,而端口系统在 TCP 和 UDP 中则不需要执行此任务。

In short, every time the kernel receives a packet that carries an L4 protocol not handled by the kernel, all the sockets that registered for that protocol receive a copy of the packet. It is up to them to accept or discard the packet. This means that the applications must have a way to understand if the packet they receive is addressed to them, a task rendered unnecessary in TCP and UDP by the port system.

原始 IP 适合ping,因为虽然可以在同一台计算机上同时运行几个 ping 实例,但它们通常不会针对相同的目标 IP 地址,并且通常每个实例仅发送几个数据包。同样,OSPF 等路由协议通常作为单个实例在每个主机上运行。

Raw IP is suitable for ping because, while it's possible for a few ping instances to be running at once on the same machine, they normally do not target the same destination IP address and normally send only a few packets each. Similarly, a routing protocol such as OSPF usually runs as a single instance on each host.

当套接字类型为SOCK_RAWRAW IP (255) 且协议为 RAW IP (255) 时,意味着应用程序同时处理 L4 标头和 IP 标头。这与前面显示的 Zebra 路由应用程序的不同之处在于该协议是 RAW IP,而不是 OSPF 等已知协议。图 24-6(d)显示了 RAW IP 情况。此类应用程序在名为(包含标头)的套接字上设置一个选项IP_HDRINCL,以告诉内核应用程序将处理 IP 标头,因此内核不需要对其执行任何操作。当协议P为 RAW IP 时,该IP_HDRINCL选项在套接字上默认打开。 跟踪路由,它需要使用 IP 标头的 TTL 字段来完成其工作,是使用该 IP_HDRINCL选项的应用程序示例。

When the socket type is SOCK_RAW and the protocol is RAW IP (255) it means that the application takes care of both the L4 header and the IP header. This differs from the Zebra routing application shown earlier in that the protocol is RAW IP instead of a known protocol such as OSPF. Figure 24-6(d) shows the RAW IP case. Such applications set an option on a socket called IP_HDRINCL (header included) to tell the kernel that the application will take care of the IP header and that the kernel therefore does not need to do anything with it. When protocol P is RAW IP, the IP_HDRINCL option is turned on by default on the socket. traceroute, which needs to plays with the TTL field of the IP header to accomplish its job, is an example of application that uses the IP_HDRINCL option.

当应用程序使用原始IP套接字时,它只需要向内核提供协议ID和目标IP地址(这将在内核将生成的IP标头上设置):内核可以忽略所有其他参数和选项一般用在L4层。

When an application uses a raw IP socket, it needs to give the kernel only the protocol ID and the destination IP address (which will be set on the IP header that the kernel will generate): the kernel can ignore all the other parameters and options normally used at the L4 layer.

用于存储原始处理程序 ( raw_v4_htable) 的表和用于存储协议处理程序 ( inet_protos) 的表大小相同,因此ip_local_deliver_finish使用相同的值hash来访问这两个表。(正如我之前所说,这个值不再是实际的哈希值。)原始数据包被提供给raw_v4_input. 该函数不直接对输入缓冲区进行操作,因为数据包属于调用者(ip_local_deliver_finish)并且可能与许多应用程序共享。因此,raw_v4_input制作本地副本(克隆)并将它们提供给主处理程序raw_rcv

The table used to store the raw handlers (raw_v4_htable) and the ones used to store the protocol handlers (inet_protos) are of the same size, so ip_local_deliver_finish uses the same value hash to access the two tables. (As I said earlier, this value is no longer an actual hash.) Raw packets are given to raw_v4_input. This function does not operate directly on the input buffer, because the packet belongs to the caller (ip_local_deliver_finish) and may be shared with many applications. Therefore, raw_v4_input makes local copies (clones) and gives them to the main handler, raw_rcv.

网络安全协议

IPsec

在将数据包传送到正确的协议处理程序之前ip_local_deliver_finish,它首先使用 IPsec 检查是否允许处理该数据包。当由于内核中缺少正确的协议处理程序而内核需要生成 ICMP 错误消息时,也会执行相同的操作。IPsec 保留一个安全策略数据库,分为入口和出口策略。由于处理传入流量,因此使用方向标志调用ip_local_deliver_finishIPsec 功能 。如果允许处理数据包,则该函数的返回值为 1;如果不允许,则返回值为 0。当传输协议的实例 初始化为 1时,不会咨询安全策略。xfrm4_policy_checkXFRM_POLICY_INnet_protocolno_policy

Before ip_local_deliver_finish delivers a packet to the right protocol handler, it first checks with IPsec whether the packet is allowed to be processed. The same is done when, due to the absence in the kernel of the right protocol handler, the kernel needs to generate an ICMP error message. IPsec keeps a database of security policies divided into ingress and egress policies. Because ip_local_deliver_finish processes incoming traffic, the IPsec function xfrm4_policy_check is invoked with the direction flag XFRM_POLICY_IN. The return value of this function is 1 if the packet is allowed to be processed, and zero if it is not. Security policies are not consulted when the transport protocol's net_protocol instance has no_policy initialized to 1.

由于篇幅原因,本书没有讨论 IPsec 套件协议的实现。

The implementation of the protocols of the IPsec suite are not discussed in this book for lack of space.

IPv4 与 IPv6

IPv4 Versus IPv6

就 L3 到 L4 协议接口而言,IPv6 与 IPv4 非常相似。L4 协议可以通过 注册inet6_add_protocol并通过 取消注册,两者都在net/ipv6/protocol.cinet6_del_protocol中定义。处理程序存储在与 IPv4 使用的大小相同的表( ) 中。在 IPv6 之上运行的 L4 协议由数据结构表示(在include/net/protocol.h中定义),其定义与 IPv4 使用的定义几乎相同。唯一的区别在于和函数指针的原型,以及使用标志而不是整数来存储诸如是否存在安全策略之类的信息。inet6_protosMAX_INET_PROTOSinet6_protocolhandlererr_handler

IPv6 is very similar to IPv4 as far as the L3 to L4 protocol interface is concerned. L4 protocols can register via inet6_add_protocol and deregister them via inet6_del_protocol, both defined in net/ipv6/protocol.c. Handlers are stored in a table called inet6_protos of the same size (MAX_INET_PROTOS) used by IPv4. L4 protocols that run on top of IPv6 are represented by inet6_protocol data structures (defined in include/net/protocol.h), whose definition is almost identical to the one used by IPv4. The only differences are in the prototypes of the handler and err_handler function pointers and the use of a flag instead of an integer to store such information as the presence of security policies.

IPv6 在 IPv6 报头中用于标识上层协议的字段称为 ,它next_header是一个 8 位值,与 IPv4 使用的字段相同。有关该字段在标头中的位置,请参见图24-8 。

The field used by IPv6 to identify the upper-layer protocol in the IPv6 header is called next_header and is an 8-bit value like the one used by IPv4. See Figure 24-8 for the location of the field in the header.

IPv6 标头和 next_header 协议标识符

图 24-8。IPv6 标头和 next_header 协议标识符

Figure 24-8. IPv6 header and next_header protocol identifier

通过 /proc 文件系统进行调整

Tuning via /proc Filesystem

/proc中没有可用于调整 L3 和 L4 之间的接口的文件。

There are no files in /proc that can be used to tune the interface between L3 and L4.

本章介绍的函数和变量

Functions and Variables Featured in This Chapter

表 24-3 总结了本章介绍的函数、变量和数据结构。

Table 24-3 summarizes the functions, variables, and data structures introduced in this chapter.

表 24-3。本章介绍的函数、变量和数据结构

Table 24-3. Functions, variables, and data structures featured in this chapter

姓名

Name

描述

Description

功能

Functions

 

inet_add_protocol

inet_add_protocol

inet_del_protocol

inet_del_protocol

注册和取消注册 IP 堆栈的 L4 协议处理程序。

Registers and unregisters an L4 protocol handler for the IP stack.

inet_init

inet_init

协议族的初始化例程AF_INET 。这是最常见的 L4 协议注册的地方。

Initialization routine for the AF_INET protocol family. It is where the most common L4 protocols are registered.

ip_local_deliver_finish

ip_local_deliver_finish

raw_v4_input

raw_v4_input

ip_local_deliver_finish将入口 IP 流量传送到正确的 L4 协议处理程序,并用于raw_v4_input向任何符合条件的 RAW IP 套接字提供副本。

ip_local_deliver_finish delivers ingress IP traffic to the right L4 protocol handlers, and it uses raw_v4_input to give a copy to any eligible RAW IP socket.

变量

Variables

 

inet_protos

inet_protos

IP 堆栈的 L4 协议处理程序表。

Table of L4 protocol handlers for the IP stack.

raw_v4_htable

raw_v4_htable

原始套接字表。

Table of raw sockets.

数据结构

Data structure

 

net_protocol

net_protocol

IP 堆栈的 L4 协议描述符。

L4 protocol descriptor for the IP stack.

本章介绍的文件和目录

Files and Directories Featured in This Chapter

内核用来调用 L4 协议处理程序的代码主要位于两个文件中:include/net/protocol.hnet/ipv{4,6}/protocol.c图 24-9中阴影较浅的文件是实现 L4 协议的文件。

The code used by the kernel to invoke the L4 protocols handlers is located mainly in two files: include/net/protocol.h and net/ipv{4,6}/protocol.c. The more lightly shaded files in Figure 24-9 are the ones that implement L4 protocols.

本章介绍的文件和目录

图 24-9。本章介绍的文件和目录

Figure 24-9. Files and directories featured in this chapter




[ * ]大多数 OSPF 实现也传递 IP 标头。参见图24-6(d)所示的情况。

[*] Most implementations of OSPF pass the IP header as well. See the case shown in Figure 24-6(d).

[ ] IP 数据包按照第 21 章dst_output中描述的例程进行传输。如果需要,负责第三层到第二层地址映射;因此,情况 (d) 并不是真正直接调用 L2。dst_output

[] The IP packet is transmitted with the dst_output routine, described in Chapter 21. dst_output takes care of the Layer three to Layer two address mapping if needed; therefore, case (d) is not really a direct call to L2.

第 25 章 Internet 控制消息协议 (ICMPv4)

Chapter 25. Internet Control Message Protocol (ICMPv4)

互联网控制消息协议 (ICMP) 是互联网主机用来交换控制消息(特别是错误通知和信息请求)的传输协议。在本章中,我们将了解 ICMPv4,即 IPv4 使用的版本。IPv6 使用 ICMPv6 协议,该协议除 ICMPv4 中的功能外还包括其他功能。

The Internet Control Message Protocol (ICMP) is a transport protocol used by Internet hosts to exchange control messages, notably error notifications and information requests. In this chapter, we will look at ICMPv4, the version used by IPv4. IPv6 uses the ICMPv6 protocol, a protocol that includes other functionalities besides the ones in ICMPv4.

多年来,ICMP 协议越来越多地被用作开发监控和测量应用程序的基础。不幸的是,ICMP 协议也经常被用作安全攻击的基础,例如 DoS 或远程指纹收集。因此,网络管理员经常配置路由器和防火墙来过滤掉大多数 ICMP 消息类型。有时他们过滤太多,违背了 RFC 建议。无论消息是否被过滤,它们通常都会受到速率限制。因此,任何基于 ICMP 构建的应用程序对于测量或监视目的并不总是可靠的。然而,由于测量并不在其最初的设计目标之内,ICMP 通常不允许监控应用程序收集它们需要的所有信息。相反,专门为此目的编写了通常基于 TCP 或 UDP 的应用程序。

Over the years, the ICMP protocol has increasingly been used as the basis for the development of monitoring and measurement applications. Unfortunately, the ICMP protocol is also often used as the basis for security attacks, such as DoS or remote fingerprint collection. For this reason, network administrators often configure routers and firewalls to filter out most ICMP message types. Sometimes they filter too much, going against the RFC recommendations. Regardless of whether messages are filtered, they are often rate limited. It follows that any application built on top of ICMP is not always reliable for measurement or monitoring purposes. However, because measurements were not in its original design goal, ICMP often does not allow monitoring applications to collect all the information they need. Instead, dedicated applications have been written for that purpose, often based on TCP or UDP.

对于对 ICMP 安全方面感兴趣的读者,我推荐以色列安全顾问 Ofir Arkin 的论文“扫描中的 ICMP 使用”(http://www.sys-security.com/archive/papers/ICMP_Scanning_v3.0.zip) 。它展示了 ICMP 消息如何(并且)用于网络扫描目的,以及为什么大多数消息应该(并且)被网络管理员过滤掉。本文还包括 ICMP 主要 RFC 的详细摘要。

For readers interested in the security aspects of ICMP, I recommend the paper "ICMP Usage in Scanning" from the Israeli security consultant Ofir Arkin (http://www.sys-security.com/archive/papers/ICMP_Scanning_v3.0.zip). It shows how ICMP messages can (and are) used for network scanning purposes and why most of them should be (and are) therefore filtered out by network administrators. The paper includes a detailed summary of the main RFCs on ICMP as well.

在本章中,我们将了解 Linux 如何实现 ICMP 协议。对于每种 ICMP 消息类型,我们将简要了解内核何时生成它以及内核在接收到它时如何处理它。有关更多详细信息,请参阅以下 RFC:

In this chapter, we'll see how Linux implements the ICMP protocol. For each ICMP message type, we will briefly see when the kernel generates it and how the kernel processes it when it is received. For more details, refer to the following RFCs:

  • RFC 792,互联网控制消息协议

  • RFC 792, Internet Control Message Protocol

  • RFC 950,互联网标准子网划分程序,附录 I

  • RFC 950, Internet Standard Subnetting Procedure, Appendix I

  • RFC 1016,主机可以使用源抑制进行的操作

  • RFC 1016, Something a Host Could Do with Source Quench

  • RFC 1191,路径 MTU 发现

  • RFC 1191, Path MTU Discovery

  • RFC 1122,互联网主机的要求 - 通信层

  • RFC 1122, Requirements for Internet Hosts—Communication Layers

  • RFC 1812,IP 版本 4 路由器的要求

  • RFC 1812, Requirements for IP Version 4 Routers

  • RFC 1256,ICMP 路由器发现消息

  • RFC 1256, ICMP Router Discovery Messages

  • RFC 1349,互联网协议簇中的服务类型

  • RFC 1349, Type of Service in the Internet Protocol Suite

特别是,RFC 792 描述了大多数 ICMP 类型的标头布局,RFC 1122 和 1812 说明主机和路由器是否应生成和处理每种 ICMP 类型。部分信息也包含在本章中。

In particular, RFC 792 describes the layout of the headers of most ICMP types, and RFCs 1122 and 1812 tell whether hosts and routers should generate and process each ICMP type. Part of that information is included in this chapter, too.

有关 ICMP 消息相关 RFC 的详细列表,您还可以查阅此 URL: http: //www.iana.org/assignments/icmp-parameters

For a detailed list of RFCs related to ICMP messages, you can also consult this URL: http://www.iana.org/assignments/icmp-parameters.

ICMP 标头

ICMP Header

图25-1显示了ICMP报头的结构

Figure 25-1 shows the structure of the ICMP header .

ICMP 标头

图 25-1。ICMP 标头

Figure 25-1. ICMP header

前三个字段对于所有 ICMP 消息类型都是通用的:

The first three fields are common to all ICMP message types:

type
type

code
code

该对标识 ICMP 消息类型。有时type单独就足以明确地识别消息,而其他时候code则需要区分同一消息类型的不同变体。有关详细信息,请参阅“ ICMP 类型”部分。

This pair identifies the ICMP message type. Sometimes type alone is sufficient to unequivocally identify the message, and other times code is needed to distinguish between different variants of the same message type. See the section "ICMP Types" for more details.

checksum
checksum

checksum涵盖 IMCP 标头和有效负载。它使用与其他主要 IP 协议(​​IP、UDP、TCP 和 IGMP)相同的算法:IP 数据包的 16 位字的补码和。有关详细信息,请参阅第 18 章中的“校验和”部分。

checksum covers the IMCP header and the payload. It uses the same algorithm as other major IP protocols (IP, UDP, TCP, and IGMP): the one's complement sum of the 16-bit words of the IP packet. See the section "Checksums" in Chapter 18 for more details.

ICMP 标头后半部分的结构取决于消息类型。由于type和的值code,接收者可以识别消息类型并相应地读取标头的其余部分。接下来的32位可以不使用、完全使用或部分使用,具体取决于消息类型;这三种不同布局的示例如图 25-1底部所示。

The structure of the second half of the ICMP header depends on the message type. Thanks to the value of type and code, the receiver can identify the message type and read the rest of the header accordingly. The following 32 bits can be unused, completely used, or partially used depending on the message type; examples of these three different layouts are shown at the bottom of Figure 25-1.

ICMP 消息分为两类:错误和查询(请求/响应)。在表 25-1中,您可以看到哪些 ICMP 类型属于每个类别。查询消息使用报头的额外 32 位来定义两个字段identifiersequence_number图 25-1(b))。这两个字段由接收方保持不变(即,从请求消息复制到响应消息),并允许源将响应与其原始请求进行匹配。

ICMP messages are classified into two categories: error and query (request/response). In Table 25-1, you can see which ICMP types fall into each category. Query messages use the extra 32 bits of the header to define the two fields identifier and sequence_number (Figure 25-1(b)). These two fields are left unchanged by the receiver (i.e., copied from the request message to the response message) and allow the source to match the response with its original request.

ICMP 错误消息包含有效负载,其内容将在下一节中描述。

ICMP error messages include a payload, whose content is described in the next section.

在 RFC 792 中,您可以找到大多数 ICMP 消息类型标头的布局。

In RFC 792, you can find the layout of most ICMP message types' headers.

ICMP 有效负载

ICMP Payload

当内核在处理入口 IP 数据包时检测到错误情况时,将发送 ICMP 错误消息。所有 ICMP 错误类型在 ICMP 有效负载中都包含相同的信息:触发​​ ICMP 消息传输的 IP 数据包的 IP 标头,加上一部分 IP 负载。生成的 IP 数据包大小不得超过 576 字节,包括外部 IP 标头和 ICMP 标头。(最后一条规则在 RFC 1812 第 4.3.2.3 节中进行了规定,该规则更新了 RFC 792 的标头定义。根据较旧的 RFC 792,ICMP 有效负载只需包含原始 IP 标头加上 64 位原始传输标头.)

ICMP error messages are sent when the kernel detects an error condition while processing an ingress IP packet. All ICMP error types include the same information in the ICMP payload : the IP header of the IP packet that triggered the transmission of the ICMP message, plus a portion of the IP payload. The resulting IP packet must not exceed 576 bytes in size, including the outer IP header and the ICMP header. (This last rule is stated in RFC 1812, section 4.3.2.3, which updates the header definitions of RFC 792. According to the older RFC 792, the ICMP payload needs to include only the original IP header plus 64 bits of the original transport header.)

图 25-2ICMP_FRAG_NEEDED显示了根据 RFC 792 的错误消息的示例。图 25-2(a)是触发 ICMP 消息传输的片段,图 25-2(b)是 ICMP 消息。请注意,ICMP 有效负载还包括原始 IP 标头和一部分传输标头。Linux 符合 RFC 1812,因此包含 如图 25-2(a)所示的额外块,最大大小为 576 字节。

Figure 25-2 shows an example of what an ICMP_FRAG_NEEDED error message looks like according to RFC 792. Figure 25-2(a) is the fragment that triggered the transmission of the ICMP message, and Figure 25-2(b) is the ICMP message. Note that the ICMP payload includes the original IP header and a piece of the transport header, too. Linux is compliant with RFC 1812, and therefore includes the extra block shown in Figure 25-2(a), up to a size of 576 bytes.

ICMP 消息的目标将使用原始 IP 标头的协议字段来识别正确的传输协议(示例中为 TCP)以及 ICMP 有效负载中的传输标头的一部分(包括源端口号和目标端口号)将允许同一目标主机识别本地套接字。因此,目标主机将有助于追踪导致错误的原因。

The protocol field of the original IP header will be used by the target of the ICMP message to identify the right transport protocol (TCP in the example) and a portion of the transport header in the ICMP payload (which includes source and destination port numbers) will allow the same target host to identify a local socket. Thus, the target host will have some help tracking down the reason it caused an error.

ICMP_DEST_UNREACH 错误消息的 ICMP 负载示例

图 25-2。ICMP_DEST_UNREACH 错误消息的 ICMP 负载示例

Figure 25-2. Example of ICMP payload for the ICMP_DEST_UNREACH error message

ICMP 类型

ICMP Types

表 25-1 列出了 ICMP 类型和定义它们的 RFC,显示它们通常是由内核还是在用户空间中传输和处理,并将每个类型分类为错误或查询消息。内核符号列在include/linux/icmp.h中。该表仅列出了 Linux 内核关心的 ICMP 消息类型(无论它们是否实现)。您可以参考本章简介中提供的 URL 以获取更新的列表。

Table 25-1 lists the ICMP types and the RFCs where they are defined, shows whether they are generally transmitted and processed by the kernel or in user space, and classifies each as an error or query message. The kernel symbols are listed in include/linux/icmp.h. The table lists only the ICMP message types the Linux kernel cares about (regardless of whether they are implemented). You can refer to the URL provided in the chapter's introduction for an updated list.

表 25-1。ICMP 类型

Table 25-1. ICMP types

类型

Type

姓名

Name

发送者

TX by

接收方式

RX by

RFC

RFC

错误/查询

Error/Query

a此选项由 RFC 792 的同一作者定义,但未在任何 RFC 中定义。

aThis option was defined by the same author of RFC 792, but it is not defined in any RFC.

0

0

8

8

回声回复

Echo Reply

回显请求

Echo Request

核心

Kernel

用户

User

用户

User

核心

Kernel

第792章

792

第792章

792

询问

Query

询问

Query

1

1

2

2

未分配

Not assigned

未分配

Not assigned

    

3

3

目的地无法到达

Destination Unreachable

核心

Kernel

核心

Kernel

第792章

792

错误

Error

4

4

源抑制

Source Quench

(已过时;请参阅 RFC 1812 第 4.3.3.3 节)

(obsolete; see RFC 1812 section 4.3.3.3)

    

5

5

重定向

Redirect

核心

Kernel

核心

Kernel

第792章

792

错误

Error

6

6

备用主机地址

Alternate Host Address

(已废弃

(obsoletea)

    

7

7

未分配

Not assigned

    

9

9

10

10

路由器通告

Router Advertisement

路由器请求

Router Solicitation

用户

User

用户

User

用户

User

用户

User

第1256章

1256

询问

Query

11

11

超过时间

Time Exceeded

核心

Kernel

核心

Kernel

第792章

792

错误

Error

12

12

参数问题

Parameter Problem

核心

Kernel

核心

Kernel

第792章

792

错误

Error

13

13

时间戳请求

Timestamp Request

用户

User

核心

Kernel

第792章

792

询问

Query

14

14

时间戳回复

Timestamp Reply

核心

Kernel

用户

User

第792章

792

询问

Query

15

15

16

16

信息请求

Information Request

信息回复

Information Reply

(已过时;请参阅 RFC 1122 第 3.2.2.7 节和 RFC 1812 第 4.3.3.7 节)

(obsolete; see RFC 1122 section 3.2.2.7 and RFC 1812 section 4.3.3.7)

    

17 号

17

18

18

地址掩码请求

Address Mask Request

地址掩码回复

Address Mask Reply

核心

Kernel

核心

Kernel

核心

Kernel

核心

Kernel

950

950

询问

Query

ICMP 类型 1、2 和 7 简单地列为未分配,类型 6 未在任何 RFC 中定义。

ICMP types 1, 2, and 7 are simply listed as unassigned, and type 6 is not defined in any RFC.

类型 9 和 10 不在内核空间中处理(因此也没有定义);路由器发现消息由实现 RFC 1256 的应用程序在用户空间中处理。对于 Linux,您可以参考rdisc,它是作为iputils 包的一部分提供的应用程序。

Types 9 and 10 are not handled (and therefore are not defined) in kernel space; the Router discovery messages are processed in user space by applications that implement RFC 1256. For Linux, you can refer to rdisc, which is an application that comes as part of the iputils package.

RFC 1122 和 RFC 1812 分别说明对于主机和路由器而言,每种 ICMP 消息类型的实现是可选的还是强制的。表 25-2总结了这些要求。对于“must”、“ should ”和“ may ”这三个词的准确解释,可以参考RFC 2119。该表不包含过时的选项。

RFC 1122 and RFC 1812 tell whether the implementation for each ICMP message type is optional or mandatory, for hosts and routers respectively. Table 25-2 summarizes these requirements. For the exact interpretation of the words must, should and may, you can refer to RFC 2119. The table does not include obsolete options.

表 25-2。主机和路由器要求

Table 25-2. Host and router requirements

类型

Type

姓名

Name

主机 (RFC 1122)

Hosts (RFC 1122)

路由器 (RFC 1812)

Routers (RFC 1812)

Linux 兼容

Linux is compliant

0

0

8

8

回声回复

Echo Reply

回显请求

Echo Request

必须实现一个回显服务器

Must implement an echo server

必须实现一个回显服务器

Must implement an echo server

是的

Yes

3

3

目的地无法到达

Destination Unreachable

应该发送 必须接收

Should transmit Must receive

必须传送

Must transmit

是的

Yes

5

5

重定向

Redirect

不应发送 必须接收

Should not transmit Must receive

必须发送 可以接收

Must transmit May receive

是的

Yes

9

9

10

10

路由器通告

Router Advertisement

路由器请求

Router Solicitation

不适用

N/A

必须收到

Must receive

No

(用户空间支持)

(Support is available in user space)

11

11

超过时间

Time Exceeded

必须收到

Must receive

必须传送

Must transmit

是的

Yes

12

12

参数问题

Parameter Problem

应该发送 必须接收

Should transmit Must receive

必须传送

Must transmit

是的

Yes

13

13

14

14

时间戳请求

Timestamp Request

时间戳回复

Timestamp Reply

可能会收到

May receive

可能会收到

May receive

可以接收/发送

May receive/transmit

是的

Yes

17 号

17

18

18

地址掩码请求

Address Mask Request

地址掩码回复

Address Mask Reply

可以接收/发送

May receive/transmit

必须接收/发送

Must receive/transmit

No

当路由器是触发 ICMP 消息传输的 IP 数据包的发起者时,它必须尊重主机要求。例如,“目的地不可达”ICMP 消息被发送到无法传送 IP 数据包的主机。当路由器生成违规数据包时,路由器必须根据表 25-2中的主机要求处理 ICMP 错误消息。请注意,路由器不能成为为其未生成的 IP 数据包发送的“目标不可达”消息的目标(这解释了为什么表 25-2没有指定路由器在收到消息时必须如何行事)。

A router must respect the host requirements when it is the originator of the IP packet that triggered the transmission of an ICMP message. For example, the Destination Unreachable ICMP message is sent to the host whose IP packet could not be delivered. When an offending packet is generated by a router, the router must process the ICMP error message according to the host requirements in Table 25-2. Note that a router cannot be the target of a Destination Unreachable message sent for an IP packet it has not generated (which explains why Table 25-2 does not specify how a router must behave when it receives one).

类似的评论适用于其他消息类型。

Similar comments apply to other message types.

ICMP_ECHO 和 ICMP_ECHOREPLY

ICMP_ECHO and ICMP_ECHOREPLY

这些可能是最常见和最著名的 ICMP 类型。它们被不同的应用程序使用,其中最著名的是ping

These are probably the most common and best-known ICMP types. They are used by different applications, the most famous of which is ping.

消息ICMP_ECHO类型用于测试远程主机的可达性。当主机收到ICMP_ECHO消息时,它会回复一条ICMP_ECHOREPLY消息。请参阅“ ping ”部分。

The ICMP_ECHO message type is used to test the reachability of a remote host. When a host receives an ICMP_ECHO message, it replies with an ICMP_ECHOREPLY message. See the section "ping."

ICMP_DEST_UNREACH

ICMP_DEST_UNREACH

当 IP 数据包无法传送到其目的地时,或者当 IP 有效负载无法传送到远程主机上的目标应用程序时,此 ICMP 类型用于通知发送方有关传送失败及其原因。这个 ICMP 类型有很多不同的子类型(code值),全部列在 表 25-3中。并非所有这些都被 Linux 使用。

When an IP packet cannot be delivered to its destination, or when the IP payload cannot be delivered to the target application on the remote host, this ICMP type is used to notify the sender about the failed delivery and its cause. This ICMP type has quite a few different subtypes (code values), all listed in Table 25-3. Not all of them are used by Linux.

用于该消息的报头包括图25-1(a)所示的32位字段。

The header used for this message includes the 32-bit field shown in Figure 25-1(a).

表 25-3。ICMP 类型 ICMP_UNREACH 的 ICMP 代码

Table 25-3. ICMP codes for ICMP type ICMP_UNREACH

代码

Code

内核符号

Kernel symbol

描述

Description

0

0

ICMP_NET_UNREACH

ICMP_NET_UNREACH

网络无法访问。

Network unreachable.

1

1

ICMP_HOST_UNREACH

ICMP_HOST_UNREACH

主机无法访问。

Host unreachable.

2

2

ICMP_PROT_UNREACH

ICMP_PROT_UNREACH

协议无法访问。目标主机上未实现 IP 之上使用的传输协议。

Protocol unreachable. The transport protocol used on top of IP is not implemented on the target host.

3

3

ICMP_PORT_UNREACH

ICMP_PORT_UNREACH

端口无法访问。没有应用程序侦听传输标头中的目标端口指定的端口号。

Port unreachable. There is no application listening to the port number specified by the destination port in the transport header.

4

4

ICMP_FRAG_NEEDED

ICMP_FRAG_NEEDED

需要碎片化。IP 数据包需要分段,但 IP 标头中设置了不分段 (DF) 标志。

Fragmentation needed. The IP packet needed to be fragmented but the Don't Fragment (DF) flag was set in the IP header.

5

5

ICMP_SR_FAILED

ICMP_SR_FAILED

源路由失败。

Source route failed.

6

6

ICMP_NET_UNKNOWN

ICMP_NET_UNKNOWN

目标网络未知。

Destination network unknown.

7

7

ICMP_HOST_UNKNOWN

ICMP_HOST_UNKNOWN

目标主机未知。

Destination host unknown.

8

8

ICMP_HOST_ISOLATED

ICMP_HOST_ISOLATED

源主机已隔离。

Source host isolated.

9

9

ICMP_NET_ANO

ICMP_NET_ANO

与目标网络的通信在管理上被禁止。

Communication with destination network is administratively prohibited.

10

10

ICMP_HOST_ANO

ICMP_HOST_ANO

与目标主机的通信在管理上被禁止。

Communication with destination host is administratively prohibited.

11

11

ICMP_NET_UNR_TOS

ICMP_NET_UNR_TOS

对于服务类型,目标网络无法访问。

Destination network unreachable for Type of Service.

12

12

ICMP_HOST_UNR_TOS

ICMP_HOST_UNR_TOS

对于服务类型,目标主机无法访问。

Destination host unreachable for Type of Service.

13

13

ICMP_PKT_FILTERED

ICMP_PKT_FILTERED

通信受到行政禁止。

Communication administratively prohibited.

14

14

ICMP_PREC_VIOLATION

ICMP_PREC_VIOLATION

违反主机优先级。

Host precedence violation.

15

15

ICMP_PREC_CUTOFF

ICMP_PREC_CUTOFF

优先级截止生效。

Precedence cutoff in effect.

ICMP_SOURCE_QUENCH

ICMP_SOURCE_QUENCH

该消息类型最初被定义为路由器向对等方通知拥塞情况的机制。然而,生成更多流量来帮助拥塞恢复并没有那么有效,RFC 1812 使这种 ICMP 消息类型变得过时。

This message type was originally defined as a mechanism for routers to inform peers about congestion. However, generating more traffic to help with congestion recovery did not turn out to be that effective, and RFC 1812 made this ICMP message type obsolete.

这种 ICMP 类型的最初目标(拥塞控制)现在由 RFC 3168 中描述的早期拥塞通知 (ECN) 机制来实现。

The original goal of this ICMP type (congestion control) is now taken care of by the Early Congestion Notification (ECN) mechanism described in RFC 3168.

ICMP_重定向

ICMP_REDIRECT

ICMP REDIRECT 消息类型仅由路由器发送,并由主机处理,也可以由路由器处理。[ * ] Linux 在/proc中提供了一个文件,允许您启用和禁用ICMP_REDIRECT消息处理。当路由器检测到相邻主机正在使用次优路由时,它们会生成此类消息;也就是说,当可以通过比生成消息的网关更好的网关到达目的地时。

ICMP REDIRECT message types are sent only by routers, and are processed by hosts and optionally by routers.[*] Linux provides a file in /proc that allows you to enable and disable the processing of ICMP_REDIRECT messages. Routers generate this type of message when they detect that a neighboring host is using suboptimal routing; that is, when a destination can be reached through a better gateway than the one generating the message.

消息的基本且最常见的原因ICMP_REDIRECT是入口数据包需要从接收它的同一设备转发出去。我们将在本节后面看到一个示例。

The basic and most common cause for an ICMP_REDIRECT message is an ingress packet that needs to be forwarded out of the same device it was received from. We will see an example later in this section.

此 ICMP 消息类型有四种子类型,如表 25-4所示。RFC 1812 规定仅 应生成ICMP_REDIR_HOSTICMP_REDIR_HOSTTOS,因为在某些情况下,使用子网划分会使处理其他两个 ICMP 代码变得更加困难。Linux 遵循此建议。

There are four subtypes for this ICMP message type, shown in Table 25-4. RFC 1812 states that only ICMP_REDIR_HOST and ICMP_REDIR_HOSTTOS should be generated because there are cases where the use of subnetting makes it harder to handle the other two ICMP codes. Linux follows this recommendation.

表 25-4。ICMP 类型 ICMP_REDIRECT 的 ICMP 代码

Table 25-4. ICMP codes for ICMP type ICMP_REDIRECT

代码

Code

内核符号

Kernel symbol

描述

Description

0

0

ICMP_REDIR_NET(过时的)

ICMP_REDIR_NET (obsolete)

重定向网络地址

Redirect for network address

1

1

ICMP_REDIR_HOST

ICMP_REDIR_HOST

重定向主机地址

Redirect for host address

2

2

ICMP_REDIR_NETTOS(过时的)

ICMP_REDIR_NETTOS (obsolete)

重定向网络地址和服务类型

Redirect for network address and Type of Service

3

3

ICMP_REDIR_HOSTTOS

ICMP_REDIR_HOSTTOS

主机地址和服务类型的重定向

Redirect for host address and Type of Service

图25-3提供了路由器生成消息的场景ICMP_REDIRECT。从拓扑中可以看出,主机 X 应该使用路由器 RT2 来到达主机 Y。但是假设主机 X 仅配置了默认网关 RT1,以便主机 X 发送到其本地网络之外的任何流量都会发送到路由器 RT1。

Figure 25-3 provides a scenario where a router generates an ICMP_REDIRECT message. From the topology it looks clear that Host X should use Router RT2 to reach Host Y. But suppose Host X has been configured only with the default gateway RT1 so that any traffic Host X sends outside its local network goes to Router RT1.

ICMP_REDIRECT 示例

图 25-3。ICMP_REDIRECT 示例

Figure 25-3. Example of ICMP_REDIRECT

当主机 X 向主机 Y 传输 IP 数据包时会发生以下情况:

This is what happens when Host X transmits an IP packet to Host Y:

  1. 主机 X 向路由器 RT1 发送一个寻址到主机 Y 的数据包。

  2. Host X sends Router RT1 a packet addressed to Host Y.

  3. 路由器 RT1 查阅其路由表并发现下一跳是路由器 RT2。它还意识到,由于路由器 RT2 与主机 X 位于同一子网上,因此主机 X 可以将数据包直接发送到路由器 RT2。

  4. Router RT1 consults its routing table and realizes the next hop is Router RT2. It also realizes that because Router RT2 is on the same subnet as Host X, Host X could have sent the packet directly to Router RT2.

  5. 路由器 RT1 向主机 X 发送一条ICMP_REDIRECT 消息,告知其更好的路由。主机X会保存该路由并在下次使用。

  6. Router RT1 sends Host X an ICMP_REDIRECT message to inform it about the better route. Host X will save the route and use it next time.

  7. 路由器 RT1 将数据包转发到路由器 RT2。

  8. Router RT1 forwards the packet to Router RT2.

ICMP_REDIRECT通常,当路由器检测到它被要求沿着次优路由路由数据包时,它会使用描述正确路由的消息回复发送者。然而,出于安全原因,这些建议现在经常被拒绝:您可以想象,如果只是说“看,要访问该网络,您应该使用路由器 XYZ 而不是您已配置的网络,那么制造麻烦是多么容易”和。”

Normally, when a router detects that it is being asked to route a packet along a suboptimal route, it replies back to the sender with an ICMP_REDIRECT message that describes the correct route. For security reasons, however, these suggestions are often rejected nowadays: you can imagine how easy it otherwise could be to create trouble by just saying, "Look, to get to that network you should use Router XYZ rather than the one you have been configured with."

在第31章的“传输ICMP_REDIRECT消息”部分中,您可以找到触发消息传输的确切条件。另请参阅第 20 章中的“ ICMP 重定向”部分,了解此 ICMP 类型和源路由 IP 选项之间的交互。在第31章的“处理入口ICMP_REDIRECT消息”部分,您可以找到有关是否 处理入口消息的详细信息。ICMP_REDIRECTICMP_REDIRECT

In the section "Transmitting ICMP_REDIRECT Messages" in Chapter 31, you can find the exact conditions that trigger the transmission of ICMP_REDIRECT messages. Also see the section "ICMP Redirect" in Chapter 20 for the interaction between this ICMP type and the Source Route IP option. In the section "Processing Ingress ICMP_REDIRECT Messages" in Chapter 31, you can find details about whether an ingress ICMP_REDIRECT message is processed.

ICMP_TIME_EXCEEDED

ICMP_TIME_EXCEEDED

该消息类型有两个子类型,如表25-5所示。

This message type has two subtypes, as shown in Table 25-5.

表 25-5。ICMP 类型 ICMP_TIME_EXCEEDED 的 ICMP 代码

Table 25-5. ICMP codes for ICMP type ICMP_TIME_EXCEEDED

代码

Code

内核符号

Kernel symbol

描述

Description

0

0

ICMP_EXC_TTL

ICMP_EXC_TTL

超出 TTL

TTL exceeded

1

1

ICMP_EXC_FRAGTIME

ICMP_EXC_FRAGTIME

片段重组时间超出

Fragment reassembly time exceeded

IP 标头包含一个字段 TTL,该字段在源和目标之间的每个中间跃点处递减。如果在数据包到达目标主机之前 TTL 变为 0,则 IP 数据包将被丢弃。丢弃数据包的中间主机ICMP_EXEC_TTL向发送方发送一条消息,通知其数据包已被丢弃。我们将在“ traceroute ”部分看到流行的命令 traceroute如何使用它。

The IP header includes a field, TTL, that is decremented at each intermediate hop between source and destination. If TTL becomes 0 before the packet reaches the destination host, the IP packet is dropped. The intermediate host that drops the packet sends an ICMP_EXEC_TTL message to the sender to inform it that its packet was dropped. We will see in the section "traceroute" how the popular command traceroute uses it.

ICMP_EXC_FRAGTIME当 IP 数据包的碎片整理花费太长时间才能完成并因此中止时,就会生成该消息。

The ICMP_EXC_FRAGTIME message is generated when the defragmentation of an IP packet takes too long to complete and is therefore aborted.

ICMP_PARAMETERPROB

ICMP_PARAMETERPROB

当处理入口 IP 数据包的 IP 标头时发现问题时,检测到问题的主机会将此类 ICMP 消息发送回源。ICMP 标头(参见图 25-1(c))包含一个偏移量,该偏移量指示在 IP 标头中发现问题的位置。

When a problem is found while processing the IP header of an ingress IP packet, the host that detects the problem sends an ICMP message of this type back to the source. The ICMP header (see Figure 25-1(c)) includes an offset that indicates where in the IP header the problem was found.

ICMP_TIMESTAMP 和 ICMP_TIMESTAMPREPLY

ICMP_TIMESTAMP and ICMP_TIMESTAMPREPLY

消息ICMP_TIMESTAMP类型可用于向远程主机请求时间戳(实际上是两个)并使用它来同步主机的时钟。收到消息的主机ICMP_TIMESTAMP 会回复如图25-4ICMP_TIMESTAMPREPLY所示的消息。

The ICMP_TIMESTAMP message type can be used to ask a remote host for a timestamp (actually two of them) and use it to synchronize the hosts' clocks. A host that receives an ICMP_TIMESTAMP message replies with an ICMP_TIMESTAMPREPLY message like that in Figure 25-4.

ICMP_TIMESTAMPREPLY 结构

图 25-4。ICMP_TIMESTAMPREPLY 结构

Figure 25-4. ICMP_TIMESTAMPREPLY structure

第一个时间戳由发送方初始化ICMP_TIMESTAMP,另外两个时间戳由发送方初始化ICMP_TIMESTAMPREPLY。第二个和第三个时间戳应反映接收消息的时间和传输ICMP_TIMESTAMP相关消息的时间。ICMP_TIMESTAMPREPLY

While the first timestamp is initialized by the ICMP_TIMESTAMP sender, the other two are initialized by the ICMP_TIMESTAMPREPLY sender. The second and third timestamps should reflect the time the ICMP_TIMESTAMP message was received and the time the associated ICMP_TIMESTAMPREPLY was transmitted.

这些 ICMP 类型没有多大用处,因为其他协议更适合相同目的(例如 NTP)。

These ICMP types are not of much use because other protocols are better suited for the same purpose (e.g., NTP).

ICMP_INFO_REQUEST 和 ICMP_INFO_REPLY

ICMP_INFO_REQUEST and ICMP_INFO_REPLY

根据 RFC 1122,这两种 ICMP 消息类型已过时,因为其他协议(例如 DHCP(以及较旧的 BOOTP 和 RARP))可以执行相同的操作,甚至更多。

According to RFC 1122, these two ICMP message types were made obsolete because other protocols such as DHCP (and the older BOOTP and RARP) can do the same thing, and much more.

ICMP_ADDRESS 和 ICMP_ADDRESSREPLY

ICMP_ADDRESS and ICMP_ADDRESSREPLY

这些 ICMP 类型的目的是允许主机通过在连接的网络上广播查询来发现要在其接口上使用的网络掩码。收到消息的路由器ICMP_ADDRESS会回复 ICMP_ADDRESSREPLY消息。回复通常是单播给发送者的,但当发送者使用源IP地址0(即未配置)时可能会广播。

The purpose of these ICMP types is to allow a host to discover the netmasks to use on its interfaces by broadcasting a query on the attached networks. A router that receives an ICMP_ADDRESS message replies with an ICMP_ADDRESSREPLY message. The reply is usually unicast to the sender, but may be broadcasted when the sender uses a source IP address of 0 (i.e., not configured).

如今,这两种消息类型的目标是通过其他方式实现的,例如 DHCP。

The goal of these two message types is achieved nowadays by other means, such as DHCP.

Linux 内核不会回复入口ICMP_ADDRESS消息,但它会侦听入口ICMP_ADDRESSREPLY消息以检测错误配置(例如错误的网络掩码配置)。

The Linux kernel does not reply to ingress ICMP_ADDRESS messages, but it listens to ingress ICMP_ADDRESSREPLY messages to detect misconfigurations (such as wrong netmask configurations).

Linux 不处理消息的原因之一ICMP_ADDRESS是同一接口可以配置多个 IP 地址,因此可能没有唯一的网络掩码来返回任何给定的请求。

Among the reasons why Linux does not process ICMP_ADDRESS messages is that the same interface can be configured with multiple IP addresses, and therefore there may not be a unique netmask to return to any given request.

根据 RFC 1812,ICMP_ADDRESSICMP_ADDRESSREPLY 消息的实现在路由器上是强制的,因此 Linux 不兼容。然而,由于这些 ICMP 消息类型并不常用,因此这种合规性缺失并不代表 Linux 存在任何兼容性问题。

According to RFC 1812, the implementation of the ICMP_ADDRESS and ICMP_ADDRESSREPLY messages is mandatory on routers, so Linux is not compliant. However, since these ICMP message types are not commonly used, this missing compliance does not represent a compatibility problem in any way for Linux.

ICMP协议的应用

Applications of the ICMP Protocol

ICMP 消息可以由内核和用户空间应用程序传输。用户空间应用程序使用我们在第 13 章中简要介绍的原始 IP 套接字接口。使用 ICMP 协议的网络故障排除工具的两个著名示例是tracerouteping。用于传输或侦听 ICMP 消息的原始 IP 套接字接口的其他用户是路由协议。

ICMP messages can be transmitted both by the kernel and by user-space applications. The user-space applications use the raw IP socket interface that we briefly introduced in Chapter 13. Two well-known examples of network troubleshooting tools that use the ICMP protocol are traceroute and ping. Other users of the raw IP socket interface for transmitting or listening to ICMP messages are routing protocols.

ping

ping不需要介绍。对于大多数人来说,它代表了接近网络区域时学到的第一个命令。给定输入 IP 地址和一组可选标志,它会将消息传输ICMP_ECHO到输入 IP 地址,并在收到关联消息时打印往返时间和其他信息ICMP_ECHOREPLY。您可以在http://ftp.arl.mil/~mike/ping.html(其创建者的主页)上找到ping的历史记录 。

ping does not need an introduction. For most people, it represents the first command learned when approaching the networking area. Given an input IP address and a set of optional flags, it transmits an ICMP_ECHO message to the input IP address, and prints the round-trip time and other information when it receives the associated ICMP_ECHOREPLY message. You can find the history of ping at http://ftp.arl.mil/~mike/ping.html, the home page of its creator.

跟踪路由

traceroute

Traceroute可能是您在 ping之后学到的第一个命令。它用于确定发出命令的主机和给定目标 IP 地址之间的路径。该路径由中间路由器的 IP 地址列表表示。

traceroute is probably the first command you learned after ping. It is used to determine the path between the host where the command is issued and a given destination IP address. The path is represented by the list of IP addresses of the intermediate routers.

Traceroute可以通过使用 UDP 或 ICMP 来实现其目标。[ * ]默认情况下,它使用 UDP,但您可以通过-I开关选项强制使用 ICMP 。正如我们将看到的,UDP 方法的成功还取决于 ICMP 消息。这两种方法都表现出了相当的聪明才智。

traceroute can achieve its goal by using either UDP or ICMP.[*] By default, it uses UDP, but you can force the use of ICMP with the -I switch option. As we will see, the UDP method also depends on an ICMP message for its success. Both methods demonstrate considerable cleverness.

让我们看看基于 ICMP 的技术是如何工作的。正如我们在第 20 章中看到的,当入口 IP 数据包的 IP 标头的 TTL 字段为 1 并且需要转发时,接收方会丢弃该数据包并向源发送回类型和代码的 ICMPICMP_TIME_EXCEEDED消息ICMP_EXC_TTLTraceroute利用此规则一次发现一个中间跳:通过将ICMP_ECHO消息发送到具有递增的 TTL 字段值(从值 1 开始)的目标 IP 地址,它确保所有中间主机都会生成ICMP_TIME_EXCEEDED 消息,并且最后一个(即目标主机)将回复一条ICMP_ECHOREPLY消息。图 25-5显示了一个示例。

Let's see how the technique based on ICMP works. As we saw in Chapter 20, when the IP header's TTL field of an ingress IP packet is 1 and forwarding is required, the receiver discards the packet and sends back to the source an ICMP message of type ICMP_TIME_EXCEEDED and code ICMP_EXC_TTL. traceroute takes advantage of this rule to discover the intermediate hops one at a time: by sending ICMP_ECHO messages to the destination IP address with increasing values of the TTL field (starting with value 1), it makes sure that all intermediate hosts will generate ICMP_TIME_EXCEEDED messages, and the last one (i.e., the target host) will reply with an ICMP_ECHOREPLY message. Figure 25-5 shows an example.

使用 ICMP 进行路由跟踪的示例

图 25-5。使用 ICMP 进行路由跟踪的示例

Figure 25-5. Example of traceroute with ICMP

图中我没有包含 ICMP 回复消息的 TTL 字段值,因为不同的操作系统使用不同的值(最常见的是 64 和 255)。

I did not include the value of the TTL field for the ICMP reply messages in the figure because different operating systems use different values (64 and 255 are the most common).

基于使用 UDP 协议的技术有些类似。它仍然利用 TTL 字段的处理方式,但不使用ICMP_ECHO消息,而是使用具有高目标端口号的 UDP 数据包,而该端口号不太可能在最终主机上使用。当 IP 数据包到达最终主机时,后者将发出类型ICMP_DEST_UNREACH为 且代码为 的 ICMP 消息ICMP_PORT_UNREACH图 25-6显示了一个示例。

The technique based on the use of the UDP protocol is somewhat similar. It still takes advantage of how the TTL field is handled, but instead of using ICMP_ECHO messages, it uses UDP packets with a high destination port number that is unlikely to be used at the end host. When the IP packet makes it to the end host, the latter will complain with an ICMP message of type ICMP_DEST_UNREACH and code ICMP_PORT_UNREACH. Figure 25-6 shows an example.

使用 UDP 的跟踪路由示例

图 25-6。使用 UDP 的跟踪路由示例

Figure 25-6. Example of traceroute with UDP

在 ICMP 和 UDP 两种情况下,中间主机都是通过独立的“探测”数据包一一发现的。这样做的两个后果值得一提:

In both cases, ICMP and UDP, the intermediate hosts are discovered one by one with independent "probe" packets. Two consequences of this are worth mentioning:

  • 与中间路由器相关的往返时间反映了网络在不同时间的拥塞状态。因此,第 n个中间路由器通常比第( n -1)个中间路由器具有更高的往返时间,但并非总是如此。

  • The round-trip times associated with the intermediate routers reflect the network's congestion state at different times. Therefore, the nth intermediate router will usually have a higher round-trip time than the (n-1)th intermediate router, but not always.

  • 用于到达第 n跳的中间路由器可能与用于到达第 ( n -1) 跳的中间路由器不同。不同的因素可能会影响选择通往给定目的地的路由,例如动态路由更改、负载均衡器等。您可以从最常见的 Linux 发行版的下载服务器下载Traceroute命令的源代码,其中包括一些值得一读的例子。

  • The intermediate routers used to reach the nth hop may not be the same ones used to reach the (n-1)th hop. Different factors can contribute to the selection of the route to take toward a given destination, such as dynamic routing changes, load balancers, etc. The source code of the traceroute command, which you can download from the most common Linux distribution's download servers, includes a few examples worth reading.

大局观

The Big Picture

图 25-7显示了与 ICMP 协议交互的内核子系统。图中仅显示了两种常见的传输协议:TCP 和 UDP,但许多其他协议也与 ICMP 交互,例如各种隧道协议(IPIP、GRE)、IPsec 套件的协议(AH、ESP、IPcomp)等。

Figure 25-7 shows the kernel subsystems with which the ICMP protocol interacts. The figure shows only the two common transport protocols, TCP and UDP, but many others also interact with ICMP, such as the various tunnel protocols (IPIP, GRE), the protocols of the IPsec suite (AH, ESP, IPcomp), etc.

大局观

图 25-7。大局观

Figure 25-7. The big picture

以下是协议之间交互的一些示例:

Here are some examples of interactions between protocols:

IP协议
IP protocol

第 24 章ip_local_deliver_finish中描述的例程将入口 ICMP 消息传递到由 ICMP 协议注册的接收例程,但它也将它们传递到根据 ICMP 协议注册的原始 IP 套接字 ( )。传输请求通过第 21 章中详细描述的 和例程提交到 IP 层。图中没有显示IP协议或路由子系统的调用点。icmp_rcvraw_v4_inputip_append_dataip_push_pending_framesicmp_send

The ip_local_deliver_finish routine, described in Chapter 24, delivers ingress ICMP messages to the receive routine icmp_rcv registered by the ICMP protocol, but it also delivers them to the raw IP sockets that registered against the ICMP protocol (raw_v4_input). Transmission requests are submitted to the IP layer via the ip_append_data and ip_push_pending_frames routines described in detail in Chapter 21. The figure does not show the points where the IP protocol or the routing subsystem call icmp_send.

路由子系统
Routing subsystem

icmp_reply ICMP 消息通过和进行传输icmp_send两者均在“传输 ICMP 消息”部分中进行了描述。这些例程使用第 33 章ip_route_output_key中描述的功能查询路由表。此外,处理入口 ICMP 消息的例程可能需要与路由子系统交互(例如通过使用和)来处理通过 ICMP 消息接收到的信息。ip_rt_redirectip_rt_frag_needed

ICMP messages are transmitted with icmp_reply and icmp_send. Both are described in the section "Transmitting ICMP Messages." These routines consult the routing table with the ip_route_output_key function described in Chapter 33. Also, the routines that process ingress ICMP messages may need to interact with the routing subsystem, such as by using ip_rt_redirect and ip_rt_frag_needed, to process the information received with an ICMP message.

套接字层
Socket layer

当入口 ICMP 消息携带错误指示时,通过调用由与错误 IP 数据包关联的传输协议注册的函数指针来通知套接字层err_handler

When an ingress ICMP message carries an error indication, the socket layer is notified by invoking the err_handler function pointer registered by the transport protocol associated with the faulty IP packet.

入口 ICMP 消息根据 ICMP 类型分派到正确的处理程序。

Ingress ICMP messages are dispatched to the right handler based on the ICMP type.

协议初始化

Protocol Initialization

icmp_initICMPv4 协议在net/ipv4/icmp.c中初始化。ICMP协议不能编译为模块,因此没有module_initormodule_cleanup函数。_ _init宏标记的含义可以在第7章icmp_init中找到 。

The ICMPv4 protocol is initialized with icmp_init in net/ipv4/icmp.c. The ICMP protocol cannot be compiled as a module, so there is no module_init or module_cleanup function. The meaning of the _ _init macro tagging icmp_init can be found in Chapter 7.

初始化包括创建一组套接字,每个 CPU 一个,将在传输内核生成的 ICMP 消息(而不是用户生成的消息)时使用。这些类型SOCK_RAW和协议的套接字IPPROTO_ICMP不会插入到内核套接字表中,因为它们不应该用作入口 ICMP 消息的目标。因此,对该函数的调用unhash会将套接字从哈希表中取出,它们已由通用例程添加到哈希表中sock_create_kern

Initialization consists of the creation of an array of sockets, one per CPU, which will be used when transmitting ICMP messages generated by the kernel (as opposed to user-generated messages). Those sockets, of type SOCK_RAW and protocol IPPROTO_ICMP, are not to be inserted into the kernel socket's table because they are not supposed to be used as targets for ingress ICMP messages. For this reason, a call to the unhash function takes the sockets out of the hash tables where they have been added by the generic routine sock_create_kern.

    void _ _init icmp_init(struct net_proto_family *ops)
    {
        结构 inet_sock *inet;
        整数我;
        for (i = 0; I < NR_CPUS; i++) {
            ...
            err = sock_create_kern(PF_INET, SOCK_RAW, IPPROTO_ICMP,
                        &per_cpu(_ _icmp_socket, i));
            如果(错误 < 0)
                panic("创建 ICMP 控制套接字失败。\n");
            ...
            inet = inet_sk(per_cpu(_ _icmp_socket, i)->sk);
            inet->uc_ttl = -1;
            inet->pmtudisc = IP_PMTUDISC_DONT;
            per_cpu(_ _icmp_socket, i)->sk->sk_prot->unhash(per_cpu(_ _icmp_socket, i)->sk);
        }
    void _ _init icmp_init(struct net_proto_family *ops)
    {
        struct inet_sock *inet;
        int i;
        for (i = 0; I < NR_CPUS; i++) {
            ...
            err = sock_create_kern(PF_INET, SOCK_RAW, IPPROTO_ICMP,
                        &per_cpu(_ _icmp_socket, i));
            if (err < 0)
                panic("Failed to create the ICMP control socket.\n");
            ...
            inet = inet_sk(per_cpu(_ _icmp_socket, i)->sk);
            inet->uc_ttl = -1;
            inet->pmtudisc = IP_PMTUDISC_DONT;
            per_cpu(_ _icmp_socket, i)->sk->sk_prot->unhash(per_cpu(_ _icmp_socket, i)->sk);
        }

uc_ttl,用于发送到单播地址的 IP 数据包的 TTL 值,被初始化为 -1 以告诉内核使用默认的单播 TTL ( sysctl_ip_default_ttl)。设置IP_PMTUDISC_DONT禁用套接字上的 PMTU 发现。

uc_ttl, the TTL value to use for IP packets sent to unicast addresses, is initialized to -1 to tell the kernel to use the default unicast TTL (sysctl_ip_default_ttl). The setting of IP_PMTUDISC_DONT disables PMTU discovery on the sockets.

可以使用net/ipv4/icmp.cicmp_socket中定义的宏来访问每个 CPU 的套接字,该宏会根据本地 CPU ID 透明地选择正确的套接字。

The per-CPU sockets can be accessed with the icmp_socket macro defined in net/ipv4/icmp.c, which transparently selects the right socket based on the local CPU ID.

     静态DEFINE_PER_CPU(结构套接字*,_ _icmp_socket)= NULL;
     #define icmp_socket _ _get_cpu_var(_ _icmp_socket)
     static DEFINE_PER_CPU(struct socket *, _ _icmp_socket) = NULL;
     #define icmp_socket    _ _get_cpu_var(_ _icmp_socket)

本章介绍的数据结构

Data Structures Featured in This Chapter

ICMP 代码使用的三个主要数据结构是:

The three main data structures used by the ICMP code are:

icmphdr
icmphdr

ICMP 标头

ICMP header .

icmp_control
icmp_control

ICMP 消息类型描述符。它的字段中有用于处理入口消息的例程。

ICMP message type descriptor. Among its fields is the routine used to process ingress messages.

icmp_bxm
icmp_bxm

输入结构作为参数提供给“传输 ICMP 消息”部分中描述的两个传输例程。它包括传输 ICMP 消息所需的所有信息。

Input structure given as a parameter to the two transmit routines described in the section "Transmitting ICMP Messages." It includes all the information necessary to transmit an ICMP message.

icmphdr 结构式

icmphdr Structure

我们在图 25-1中看到了ICMP 消息的结构。以下来自include/linux/icmp.h,是用于定义 ICMP 标头的数据结构:

We saw in Figure 25-1 the structure of an ICMP message. The following, from include/linux/icmp.h, is the data structure used to define an ICMP header:

    结构 icmphdr {
      _ _u8 类型;
      _ _u8 代码;
      _ _u16 校验和;
      联盟{
            结构体{
                _ _u16 ID;
                _ _u16 序列;
            回声;
            _ _u32 网关;
            结构体{
                _ _u16 __ 未使用;
                _ _u16 mtu;
            } 碎片;
      } 联合国;
    };
    struct icmphdr {
      _ _u8    type;
      _ _u8    code;
      _ _u16    checksum;
      union {
            struct {
                _ _u16    id;
                _ _u16    sequence;
            } echo;
            _ _u32    gateway;
            struct {
                _ _u16    _ _unused;
                _ _u16    mtu;
            } frag;
      } un;
    };

首先是所有 ICMP 类型共有的三个字段,然后是根据消息类型提供不同字段的联合。例如,由 消息和查询消息 un.frag使用(即,, 等)。ICMP_FRAG_NEEDEDun.echoICMP_ECHOICMP_ECHOREPLY

First come the three fields common to all ICMP types, and then a union that provides different fields based on the message type. For example, un.frag is used by ICMP_FRAG_NEEDED messages, and un.echo by the query messages (i.e., ICMP_ECHO, ICMP_ECHOREPLY, etc.).

icmp_control 结构

icmp_control Structure

对于每种 ICMP 类型,都有一个数据结构实例icmp_control(在net/ipv4/icmp.c中定义)。在其他字段中,它包括一个指向要调用以处理入口 ICMP 消息的例程的指针。以下是它的字段:

For each ICMP type there is an instance of an icmp_control data structure (defined in net/ipv4/icmp.c). Among other fields, it includes a pointer to the routine that is to be called to process ingress ICMP messages. Here are its fields:

int output_entry
int output_entry

int input_entry
int input_entry

“传输 ICMP 消息icmp_rcv”部分中的接收例程和传输例程使用的索引来更新数组中正确的 SNMP 计数器。请参阅“ ICMP 统计信息”部分。

Indexes used by the receive routine icmp_rcv and the transmission routines in the section "Transmitting ICMP Messages" to update the right SNMP counter in an array. See the section "ICMP Statistics."

void (*handler)(struct sk_buff *skb)
void (*handler)(struct sk_buff *skb)

接收例程调用的函数icmp_rcv来处理传入的 ICMP 消息。

Function invoked by the receiving routine icmp_rcv to process incoming ICMP messages.

short error
short error

当 ICMP 类型被分类为错误(而不是查询)时设置的标志。参见表25-1

Flag that is set when the ICMP type is classified as an error (as opposed to a query). See Table 25-1.

以下是该字段有用的两个示例,如“传输 ICMP 错误消息error”部分中所述:

Here are two examples where the error field is useful, as mentioned in the section "Transmitting ICMP Error Messages":

  • 内核可以检查以确保它没有用另一个 ICMP 错误消息来回复入口 ICMP 错误消息,这是禁止的。

  • The kernel can check to make sure it is not replying to an ingress ICMP error message with another ICMP error message, which is prohibited.

  • 被归类为错误的 ICMP 类型被赋予更好的 TOS ( IPTOS_PREC_INTERNETCONTROL),因为它们被认为更重要(请参阅icmp_send [ * ])。

  • ICMP types that are classified as errors are given a better TOS (IPTOS_PREC_INTERNETCONTROL) since they are considered more important (see icmp_send [*]).

请参阅“接收 ICMP 消息”部分以了解icmp_control数据结构的组织方式。

Refer to the section "Receiving ICMP Messages" to see how icmp_control data structures are organized.

icmp_bxm结构

icmp_bxm Structure

icmp_bxm在net/ipv4/icmp.c中定义。以下是其字段的说明:

icmp_bxm is defined in net/ipv4/icmp.c. Here is a description of its fields:

struct sk_buff *skb
struct sk_buff *skb

对于使用 发送的 ICMP 消息icmp_send,表示触发传输的入口 IP 数据包。对于使用 发送的 ICMP 消息icmp_reply,表示入口 ICMP 消息请求。

For ICMP messages sent with icmp_send, represents the ingress IP packet that triggered the transmission. For ICMP messages sent with icmp_reply, represents an ingress ICMP message request.

int offset
int offset

skb->data和之间的偏移量skb->nh(即 IP 标头的大小)。当评估需要它的 ICMP 消息的 ICMP 有效负载中可以放入多少数据时,此偏移量非常有用(请参阅“ ICMP 有效负载”部分)。

Offset between skb->data and skb->nh (i.e., the size of the IP header). This offset is useful when evaluating how much data can be put into the ICMP payload for those ICMP messages that require it (see the section "ICMP Payload").

int data_len
int data_len

ICMP 有效负载的大小。

Size of the ICMP payload.

struct {
struct {

    结构 icmphdr icmph;
    _ _u32次[3];

    struct icmphdr icmph;
    _ _u32 times[3];

} data
} data

icmph是要传输的 ICMP 消息的标头。times由消息类型使用ICMP_TIMESTAMPREPLY(见图25-4)。

icmph is the header of the ICMP message to transmit. times is used by the ICMP_TIMESTAMPREPLY message type (see Figure 25-4).

int head_len
int head_len

ICMP 标头的大小。

Size of the ICMP header.

struct ip_options replyopts
struct ip_options replyopts

unsigned char optbuf
unsigned char optbuf

replyopts存储在 IP 层使用的 IP 选项。它是ip_options_echo根据 的 IP 选项进行初始化的skboptbuf是 的扩展 ,可通过的 字段replyopts访问。参见 第 19 章ip_options_echo_ _dataip_options

replyopts stores the IP options to use at the IP layer. It is initialized with ip_options_echo based on the IP options of skb. optbuf is an extension of replyopts that is accessed by ip_options_echo via the _ _data field of ip_options. See Chapter 19.

传输ICMP报文

Transmitting ICMP Messages

“ ICMP 标头”部分中介绍的两类 ICMP 消息(错误和查询)使用两种不同的例程进行传输:

The two classes of ICMP messages introduced in the section "ICMP Header," errors and queries, are transmitted using two different routines:

icmp_send
icmp_send

当检测到特定条件时,内核使用它来传输 ICMP 错误消息。

Used by the kernel to transmit ICMP error messages when specific conditions are detected.

icmp_reply
icmp_reply

由 ICMP 协议用来回复需要响应的入口 ICMP 请求消息。

Used by the ICMP protocol to reply to ingress ICMP request messages that require a response.

两个例程都在输入中接收skb缓冲区。然而,用作输入的 代表icmp_send触发 ICMP 消息传输的入口 IP 数据包,而输入 的 代表icmp_reply需要响应的入口 ICMP 请求消息。

Both routines receive an skb buffer in input. However, the one used as input to icmp_send represents the ingress IP packet that triggered the transmission of the ICMP message, whereas the one in input to icmp_reply represents an ingress ICMP request message that requires a response.

net/core/icmp.c中的代码处理传入的 ICMP 消息,因此始终用于icmp_reply传输 ICMP 消息以响应输入中收到的另一条 ICMP 消息。其他内核网络子系统(即路由、IP等)icmp_send需要生成ICMP报文时使用,如图25-8所示。

The code in net/core/icmp.c processes incoming ICMP messages, and therefore always uses icmp_reply to transmit an ICMP message in response to another one received in input. Other kernel network subsystems (i.e., routing, IP, etc.) use icmp_send when they need to generate ICMP messages, as shown in Figure 25-8.

使用 icmp_send/icmp_reply 的子系统

图 25-8。使用 icmp_send/icmp_reply 的子系统

Figure 25-8. Subsystems using icmp_send/icmp_reply

在这两种情况下:

In both cases:

  • ip_route_output_key用于查找到达目的地的路线(参见第 33 章)。

  • ip_route_output_key is used to find the route to the destination (see Chapter 33).

  • 这两个例程ip_append_data用于ip_push_pending_frames向IP 层请求传输。这些例程在第 21 章中描述。

  • The two routines ip_append_data and ip_push_pending_frames are used to request a transmission to the IP layer. These routines are described in Chapter 21.

  • 在内核空间中生成的 ICMP 消息受到速率限制(如果内核已配置为通过/proc执行此操作) (请参阅“速率限制icmpv4_xrlim_allow”部分)。

  • ICMP messages generated in kernel space are rate limited (if the kernel has been configured to do it via /proc) with icmpv4_xrlim_allow (see the section "Rate Limiting").

  • icmp_xmit_lock传输通过和与每 CPU 自旋锁进行序列化icmp_xmit_unlock。每 CPU 自旋锁通过每 CPU ICMP 套接字访问(请参阅“协议初始化”部分)。当自旋锁因已被持有而无法获取时,传输失败(但两个例程都不返回错误代码)。

  • Transmissions are serialized with a per-CPU spin lock through icmp_xmit_lock and icmp_xmit_unlock. The per-CPU spin locks are accessed via the per-CPU ICMP sockets (see the section "Protocol Initialization"). When the spin lock cannot be acquired because it is already held, transmission fails (but neither of the routines returns an error code).

表 25-6、25-7和25-8显示表 25-1中的 ICMP 类型是由内核生成的。对于本书中介绍的那些子系统,它还包括对生成 ICMP 消息的例程的引用。

Tables 25-6, 25-7, and 25-8 show where the ICMP types in Table 25-1 are generated by the kernel. For those subsystems covered in this book, it also includes a reference to the routines where the ICMP messages are generated.

表 25-6。生成 ICMP 消息的网络子系统

Table 25-6. Network subsystems that generate ICMP messages

类型

Type

姓名

Name

生成者

Generated by

0

0

ICMP_ECHOREPLY

ICMP_ECHOREPLY

ICMP ( icmp_echo)

ICMP (icmp_echo)

3

3

ICMP_DEST_UNREACH

ICMP_DEST_UNREACH

见表25-7

See Table 25-7

5

5

ICMP_REDIRECT

ICMP_REDIRECT

路由 ( ip_rt_send_redirect)

Routing (ip_rt_send_redirect)

11

11

ICMP_TIME_EXCEEDED

ICMP_TIME_EXCEEDED

见表25-8

See Table 25-8

12

12

ICMP_PARAMETERPROB

ICMP_PARAMETERPROB

IPv4 ( ip_options_compile, ip_options_rcv_srr)

IPv4 (ip_options_compile, ip_options_rcv_srr)

14

14

ICMP_TIMESTAMPREPLY

ICMP_TIMESTAMPREPLY

ICMP ( icmp_timestamp)

ICMP (icmp_timestamp)

表 25-7。生成 ICMP_DEST_UNREACH 消息类型变体的网络子系统

Table 25-7. Network subsystems that generate variants of the ICMP_DEST_UNREACH message type

代码

Code

内核符号

Kernel symbol

生成者

Generated by

0

0

ICMP_NET_UNREACH

ICMP_NET_UNREACH

路由 ( ip_error)、网络过滤器

Routing (ip_error), Netfilter

1

1

ICMP_HOST_UNREACH

ICMP_HOST_UNREACH

路由 ( ip_error, ipv4_link_failure)、Netfilter、GRE、IPIP

Routing (ip_error, ipv4_link_failure), Netfilter, GRE, IPIP

2

2

ICMP_PROT_UNREACH

ICMP_PROT_UNREACH

IPv4 ( ip_local_deliver_finish)、Netfilter、GRE

IPv4 (ip_local_deliver_finish), Netfilter, GRE

3

3

ICMP_PORT_UNREACH

ICMP_PORT_UNREACH

网络过滤器、GRE、IPIP、UDP

Netfilter, GRE, IPIP, UDP

4

4

ICMP_FRAG_NEEDED

ICMP_FRAG_NEEDED

IPv4 ( ip_fragment)、GRE、IPIP、虚拟服务器

IPv4 (ip_fragment), GRE, IPIP, Virtual Server

5

5

ICMP_SR_FAILED

ICMP_SR_FAILED

IPv4 ( ip_forward)

IPv4 (ip_forward)

9

9

ICMP_NET_ANO

ICMP_NET_ANO

网络过滤器

Netfilter

10

10

ICMP_HOST_ANO

ICMP_HOST_ANO

网络过滤器

Netfilter

13

13

ICMP_PKT_FILTERED

ICMP_PKT_FILTERED

路由 ( ip_error)、网络过滤器

Routing (ip_error), Netfilter

NetfilterICMP_DEST_UNREACH根据应用的配置(例如使用 iptables )丢弃入口 IP 数据包时生成消息。目标的-reject-with选项REJECT允许用户在拒绝与给定规则匹配的入口 IP 数据包时选择要使用的 ICMP 消息类型。

Netfilter generates ICMP_DEST_UNREACH messages when it drops ingress IP packets according to the configuration applied, for instance, with iptables. The -reject-with option for the REJECT target allows the user to select which ICMP message type to use when rejecting ingress IP packets that match a given rule.

分别在net/ipv4/ipip.cnet/ipv4/ip_gre.c中定义的 IPIP 和 GRE 等隧道协议需要根据 RFC 2003 第 4 节中的规则处理 ICMP 消息。

Tunneling protocols such as IPIP and GRE, defined in net/ipv4/ipip.c and net/ipv4/ip_gre.c, respectively, need to handle ICMP messages according to the rules in RFC 2003, section 4.

表 25-8。生成 ICMP_TIME_EXCEEDED 消息类型变体的网络子系统

Table 25-8. Network subsystems that generate variants of the ICMP_TIME_EXCEEDED message type

代码

Code

内核符号

Kernel symbol

生成者

Generated by

0

0

ICMP_EXC_TTL

ICMP_EXC_TTL

IPv4 ( ip_forward)

IPv4 (ip_forward)

1

1

ICMP_EXC_FRAGTIME

ICMP_EXC_FRAGTIME

IPv4 ( ip_expire)

IPv4 (ip_expire)

传输ICMP错误消息

Transmitting ICMP Error Messages

图 25-9(a) 和 25-9(b) 显示了 的内部结构icmp_send。以下是其输入参数:

Figures 25-9(a) and 25-9(b) show the internals of icmp_send. Here are its input parameters:

skb_in
skb_in

与错误相关的输入 IP 数据包。

Input IP packet the error is associated with.

type
type

code
code

ICMP 标头中使用的类型和代码字段。

Type and code fields to use in the ICMP header.

info
info

附加信息:消息的 MTU ICMP_FRAG_NEEDED、消息的网关地址ICMP_REDIRECT以及ICMP_PARAMETERPROB消息的偏移量。

icmp_发送函数

图 25-9a。icmp_发送函数

icmp_发送函数

图 25-9b。icmp_发送函数

Additional information: an MTU for ICMP_FRAG_NEEDED messages, a gateway address for ICMP_REDIRECT messages, and an offset for ICMP_PARAMETERPROB messages.

Figure 25-9a. icmp_send function

Figure 25-9b. icmp_send function

icmp_send首先进行一些健全性检查以过滤掉非法请求。以下情况会导致其中止:

icmp_send starts with a few sanity checks to filter out illegal requests. The following conditions cause it to abort:

  • IP 数据报以广播或多播形式接收。这种情况是通过检查与关联的路由缓存条目的RTCF_BROADCAST和标志来检测的。RTCF_MULTICASTskb_in

  • The IP datagram is received as broadcast or multicast. This case is detected by checking the RTCF_BROADCAST and RTCF_MULTICAST flags of the routing cache entry associated with skb_in.

  • 接收封装在广播链路层帧中的 IP 数据报。skb_in->pkt_type通过将数据包类型与进行比较来检测这种情况PACKET_HOST

  • The IP datagram is received encapsulated in a broadcast link layer frame. This case is detected by comparing the packet type skb_in->pkt_type against PACKET_HOST.

  • IP 数据报是一个片段,它不是原始数据包的第一个片段。这种情况可以通过读取IP头的偏移字段来检测(参见 第22章)。

  • The IP datagram is a fragment, and it is not the first one of the original packet. This case can be detected by reading the offset field of the IP header (see Chapter 22).

  • IP数据报携带ICMP错误信息。您不得使用错误消息来回复错误消息。

  • The IP datagram carries an ICMP error message. You must not use an error message to reply to an error message.

ICMP 层不负责初始化 IP 标头。然而,IP 层将根据 ICMP 的要求初始化几个 IP 标头字段。尤其:

It is not the responsibility of the ICMP layer to initialize the IP header. However, a couple of IP header fields will be initialized by the IP layer according to the requirements of ICMP. In particular:

源IP地址
Source IP address

当 ICMP 消息的目标不是本地配置的 IP 地址(即RTCF_LOCAL)时,将根据配置选择要放置在封装标头中的源 IP 地址(请参阅“通过 /proc 文件系统进行调整sysctl_icmp_errors_use_inbound_ifaddr”部分)。

When the target of the ICMP message is not a locally configured IP address (i.e., RTCF_LOCAL), the source IP address to place in the encapsulating header is selected according to the sysctl_icmp_errors_use_inbound_ifaddr configuration (see the section "Tuning via /proc Filesystem").

服务类型 (TOS)
Type of Service (TOS)

该 TOS 是从 的 TOS 复制的skb_in。另外,当ICMP消息被分类为错误时(参见表25-1),TOS的优先级部分被初始化为IPTOS_PREC_INTERNETCONTROL(即,该消息具有更高的优先级)。有关 TOS 的更多信息,请参阅第 18 章。

The TOS is copied from the TOS of skb_in. In addition, when the ICMP message is classified as an error (see Table 25-1), the precedence's component of the TOS is initialized to IPTOS_PREC_INTERNETCONTROL (i.e., this message has higher precedence). See Chapter 18 for more information on TOS.

IP选项
IP options

IP 选项是从skb_inwith复制并反转的ip_options_echo请参阅第 19 章中的“ IP 选项”部分。

The IP options are copied and reversed from skb_in with ip_options_echo. See the section "IP Options" in Chapter 19.

接下来,该函数使用 查找到目的地的路由,这是第 33 章ip_route_output_key中介绍的缓存查找例程。

Next, the function finds the route to the destination with ip_route_output_key, which is a cache lookup routine introduced in Chapter 33.

请注意,如图25-8所示,通过 icmpv4_xrlim_allow例程使用令牌桶算法对传输进行速率限制。当 ICMP 消息未被令牌桶算法抑制时,传输以对 的调用结束 icmp_push_reply,最终调用图 25-8中所示的两个 IP 例程。

Note that, as shown in Figure 25-8, transmissions are rate limited with a token bucket algorithm via the icmpv4_xrlim_allow routine. When the ICMP message is not suppressed by the token bucket algorithm, the transmission ends with a call to icmp_push_reply, which ends up calling the two IP routines shown in Figure 25-8.

回复入口 ICMP 消息

Replying to Ingress ICMP Messages

正如“ ICMP 标头”部分中提到的,ICMP 消息类型的子集成对出现:请求消息和响应消息。例如,ICMP_ECHOREPLY发送一条消息作为对入口ICMP_ECHO消息的应答。响应消息的传输如下:

As mentioned in the section "ICMP Header," a subset of the ICMP message types comes in pairs: a request message and a response message. For one example, an ICMP_ECHOREPLY message is sent in answer to an ingress ICMP_ECHO message. The transmission of response messages is done as follows:

  1. 响应消息的标头首先从入口请求 ICMP 消息中复制。

  2. The header of the response message is first copied from the ingress request ICMP message.

  3. ICMP 标头的类型字段被更新(例如,ICMP_ECHO替换为ICMP_ECHOREPLY)。

  4. The type field of the ICMP header is updated (for example, ICMP_ECHO is replaced with ICMP_ECHOREPLY).

  5. icmp_reply被调用来完成传输(即计算ICMP报头的校验和、查找到目的地的路由、填充IP报头等)。

  6. icmp_reply is called to complete the transmission (i.e., to compute the checksum on the ICMP header, find the route to the destination, fill in the IP header, etc.).

速率限制

Rate Limiting

ICMP 消息在两个地方受到速率限制:

ICMP messages are rate limited in two places:

通过路由代码
By the routing code

路由码速率仅限制出局ICMP_DEST_UNREACHICMP_REDIRECT消息类型。请参阅第 35 章中的“路由失败”部分和第 33 章中的“出口 ICMP 重定向速率限制”部分。

The routing code rate limits only the outgoing ICMP_DEST_UNREACH and ICMP_REDIRECT message types. See the section "Routing Failure" in Chapter 35 and the section "Egress ICMP REDIRECT Rate Limiting" in Chapter 33.

通过 ICMP 代码
By the ICMP code

ICMP 代码可以对所有传出 ICMP 消息类型进行速率限制(仅本节后面列出的少数例外),包括也受路由代码限制速率的类型。

The ICMP code can rate limit all outgoing ICMP message types (with only the few exceptions listed later in this section), including the types that are also rate limited by the routing code.

两种类型的速率限制 有一个重要的区别:路由代码率限制每个目标 IP 地址的 ICMP 消息,而 ICMP 代码率限制每个源 IP 地址。这意味着受 ICMP 和路由代码限制速率的类型会受到两次速率限制。

The two types of rate limiting differ in an important way: the routing code rate limits ICMP messages per destination IP address, and the ICMP code rate limits per source IP address. This means that the types that are rate limited by both ICMP and the routing code are rate limited twice.

让我澄清这一点。内核将应用令牌桶算法所需的速率限制信息保留在dst_entry 路由缓存的条目中。每个dst_entry实例都与一个目标 IP 地址相关联(更多详细信息请参见第 33 章)。仅此一点就告诉我们,速率限制是基于每个 IP 地址应用的,而不是基于每个 ICMP 消息类型,但让我们具体看看每个源和每个目标的速率限制有何不同:

Let me clarify this point. The kernel keeps the rate-limiting information needed to apply the token bucket algorithm in the dst_entry entries of the routing cache. Each dst_entry instance is associated with a destination IP address (more details in Chapter 33). This alone tells us that rate limiting is applied on a per-IP-address basis, not on a per-ICMP-message-type basis, but let's see exactly how per-source and per-destination rate limiting differ:

  • 当内核子系统(例如 IPv4 协议)处理满足特定错误条件的输入 IP 数据包时,它会将 ICMP 错误消息发送回入口 IP 数据包的源。ICMP代码查阅路由表,路由查找返回缓存条目,缓存条目用于存储速率限制信息。该缓存表项与从本机到故障IP报文的来源(即到故障IP报文的源IP地址)的路由相关联。这称为按源 IP 地址速率限制。

  • When a kernel subsystem, such as the IPv4 protocol, processes an input IP packet that meets certain error conditions, it sends an ICMP error message back to the source of the ingress IP packet. The ICMP code consults the routing table, the routing lookup returns a cache entry, and the cache entry is used to store the rate limiting information. This cache entry is associated with the route from the local host to the source of the faulty IP packet—that is, to the source IP address of the faulty IP packet. This is called per-source IP address rate limiting.

  • 当路由代码无法路由入口 IP 数据包时,它会生成一条 ICMP_HOST_UNREACH消息,而ICMP_REDIRECT当通过另一个网关更好地到达入口 IP 数据包的目标 IP 地址时,它会生成一条消息。在这两种情况下,路由代码都会向缓存添加一个条目,该条目的关联目标 IP 地址是入口 IP 数据包的目标 IP 地址。这就是为什么这被称为按目标 IP 地址速率限制。第 35 章解释了后续匹配的 IP 数据包如何使用此类缓存条目。

  • When the routing code cannot route an ingress IP packet, it generates an ICMP_HOST_UNREACH message, whereas it generates an ICMP_REDIRECT message when the destination IP address of the ingress IP packet is better reached via another gateway. In both cases, the routing code adds an entry to the cache whose associated destination IP address is the destination IP address of the ingress IP packet. This is why this is called per-destination IP address rate limiting. Chapter 35 explains how such cache entries will be used by subsequent matching IP packets.

速率限制的实施

Implementation of Rate Limiting

现在让我们看看 ICMP 代码如何应用其速率限制。如图25-10所示,每当传输ICMP消息并且在内核中配置速率限制时,icmpv4_xrlim_allow都会调用该函数来强制执行速率限制。速率限制 ( ) 的 ICMP 消息类型sysctl_icmp_ratemask和速率限制的速率 ( ) 都可以通过/procsysctl_icmp_ratelimit进行配置(请参阅“通过 /proc 文件系统进行调整”部分)。

Let's see now how the ICMP code applies its rate limiting. As shown in Figure 25-10, any time an ICMP message is transmitted and rate limiting is configured in the kernel, the icmpv4_xrlim_allow function is called to enforce rate limiting. Both the ICMP message types to rate limit (sysctl_icmp_ratemask) and the rate limit's rate (sysctl_icmp_ratelimit) can be configured via /proc (see the section "Tuning via /proc Filesystem").

icmpv4_xrlim_allow 函数

图 25-10。icmpv4_xrlim_allow 函数

Figure 25-10. icmpv4_xrlim_allow function

icmpv4_xrlim_allow在以下情况下不应用任何速率限制:

icmpv4_xrlim_allow does not apply any rate limiting in the following cases:

  • 内核不知道其类型的 ICMP 消息(它们可能是重要的消息)。

  • ICMP messages whose type is not known to the kernel (they could be important ones).

  • RFC 1191 中描述的 PMTU 协议使用的 ICMP 消息(即类型 ICMP_DEST_UNREACH和代码ICMP_FRAG_NEEDED)。[ * ] PMTU 在第 18 章中进行了简要描述。

  • ICMP messages used by the PMTU protocol described in RFC 1191 (i.e., type ICMP_DEST_UNREACH and code ICMP_FRAG_NEEDED).[*] PMTU is briefly described in Chapter 18.

  • ICMP 在环回设备上发出。

  • ICMPs sent out on the loopback device.

icmpv4_xrlim_allow是一个更通用的函数的包装器,xlim_allow它完成真正的工作。如果根据sysctl_icmp_ratemask位图,ICMP 消息要受到速率限制,则调用它。

icmpv4_xrlim_allow is a wrapper for a more general-purpose function, xlim_allow, which does the real job. It is called if, according to the sysctl_icmp_ratemask bitmap, the ICMP message is to be rate limited.

    #define XRLIM_BURST_FACTOR 6
    int xrlim_allow(struct dst_entry *dst, int 超时)
    {
        现在未签名很久了;
        int rc = 0;

        现在= jiffies;
        dst->rate_tokens += now - dst->rate_last;
        dst->rate_last = 现在;
        if (dst->rate_tokens > XRLIM_BURST_FACTOR * 超时)
                dst->rate_tokens = XRLIM_BURST_FACTOR * 超时;
        if (dst->rate_tokens >= 超时) {
            dst->rate_tokens -= 超时;
            返回1;
        }
        返回rc;
    }
    #define XRLIM_BURST_FACTOR 6
    int xrlim_allow(struct dst_entry *dst, int timeout)
    {
        unsigned long now;
        int rc = 0;

        now = jiffies;
        dst->rate_tokens += now - dst->rate_last;
        dst->rate_last = now;
        if (dst->rate_tokens > XRLIM_BURST_FACTOR * timeout)
                dst->rate_tokens = XRLIM_BURST_FACTOR * timeout;
        if (dst->rate_tokens >= timeout) {
            dst->rate_tokens -= timeout;
            return 1;
        }
        return rc;
    }

xrlim_allow应用简单的令牌桶算法。每当调用它时,它都会更新可用dst->rate_tokens令牌(在 中测量jiffies),确保累积的令牌不超过预定义的最大值 ( XRLIM_BURST_FACTOR),并在可用令牌足够的情况下允许传输 ICMP 消息。输入参数timeout表示要强制执行的速率,以 Hz 表示(例如,1*HZ 表示每秒一个 ICMP 消息的速率限制)。

xrlim_allow applies a simple token bucket algorithm. Whenever it is called, it updates the available dst->rate_tokens tokens (measured in jiffies), makes sure that the accumulated tokens are not more than a predefined maximum value (XRLIM_BURST_FACTOR), and allows the transmission of the ICMP message if the available tokens are sufficient. The input parameter timeout represents the rate to enforce, expressed in Hz (for example, 1*HZ would mean a rate limit of one ICMP message per second).

请注意,由于xrlim_allow是由不同协议共享的通用例程,因此它对独立于协议的路由缓存条目(dst_entry结构)进行操作,并且icmpv4_xrlim_allow是 IPv4 例程,因此对rtable数据结构进行操作。有关 dst_entryrtable数据结构的更多详细信息,请参阅第 36 章

Note that since xrlim_allow is a generic routine shared by different protocols, it operates on protocol-independent routing cache entries (dst_entry structures), and icmpv4_xrlim_allow is an IPv4 routine and therefore operates on rtable data structures. For more details on the dst_entry and rtable data structures, please refer to Chapter 36.

接收 ICMP 消息

Receiving ICMP Messages

icmp_rcvip_local_deliver_finish是处理入口 ICMP 消息所调用的函数。

icmp_rcv is the function called by ip_local_deliver_finish to process ingress ICMP messages.

ICMP协议注册其接收net/ipv4/protocol.cicmp_rcv中的例程,如第 24 章所述。有关入口 IP 数据包本地传送的更多详细信息,请参阅第 20 章。

The ICMP protocol registers its receiving routine icmp_rcv in net/ipv4/protocol.c, as described in Chapter 24. See Chapter 20 for more details on local delivery of ingress IP packets.

首先,验证 ICMP 消息的校验和。请注意,即使接收 NIC 能够在硬件中计算 L4 校验和(在本例中为 ICMP 校验和)并且该校验和表明 ICMP 消息已损坏,也会在软件中再次验证校验和icmp_rcv。有关 NIC 支持的 L4 校验和的更多详细信息,请参阅第 19 章中的“ sk_buff 结构”部分。

First, the ICMP message's checksum is verified. Note that even when the receiving NIC is able to compute the L4 checksum in hardware (which would be the ICMP checksum in this case) and that checksum says the ICMP message is corrupted, icmp_rcv verifies the checksum once more in software. You can refer to the section "sk_buff structure" in Chapter 19 for more details on L4 checksumming support by NICs.

并非所有 ICMP 消息类型都可以发送到多播 IP 地址:仅ICMP_ECHOICMP_TIMESTAMPICMP_ADDRESSIMCP_ADDRESSREPLYicmp_rcv过滤掉那些不遵守此规则的消息。特别是,ICMP_ECHO如果系统已配置为这样做,则入口广播消息将被丢弃。请参阅“通过 /proc 文件系统进行调整”部分。

Not all ICMP message types can be sent to a multicast IP address: only ICMP_ECHO, ICMP_TIMESTAMP, ICMP_ADDRESS, and IMCP_ADDRESSREPLY. icmp_rcv filters out those messages that do not respect this rule. In particular, ingress broadcast ICMP_ECHO messages are dropped if the system has been configured to do so. See the section "Tuning via /proc Filesystem."

当满足所有健全性检查时,icmp_rcv 将入口 ICMP 消息传递到正确的帮助程序。后者是通过在net/ipv4/icmp.cicmp_pointers末尾初始化的向量访问的。是一个数据结构数组。表 25-9总结了 的部分初始化。有关和字段的确切含义,请参阅“ icmp_control 结构”部分。表中未包含的任何类型均已过时、不受支持或不应在内核空间中处理。对于所有这些类型,被初始化为。icmp_pointersicmp_controlicmp_pointershandlererrorhandlericmp_discard

When all sanity checks are satisfied, icmp_rcv passes the ingress ICMP message to the right helper routine. The latter is accessed via the icmp_pointers vector that is initialized at the end of net/ipv4/icmp.c. icmp_pointers is an array of icmp_control data structures. Table 25-9 summarizes part of icmp_pointers's initialization. See the section "icmp_control Structure" for the exact meaning of the handler and error fields. Any types not in the table are obsolete, unsupported, or not supposed to be processed in kernel space. For all these types, handler is initialized to icmp_discard.

表 25-9。处理程序的初始化和错误

Table 25-9. Initialization of handler and error

类型

Type

内核符号

Kernel symbol

处理程序

Handler

错误

Error

3

3

ICMP_DEST_UNREACH

ICMP_DEST_UNREACH

icmp_unreach

icmp_unreach

1

1

4

4

ICMP_SOURCE_QUENCH

ICMP_SOURCE_QUENCH

icmp_unreach

icmp_unreach

1

1

5

5

ICMP_REDIRECT

ICMP_REDIRECT

icmp_redirect

icmp_redirect

1

1

8

8

ICMP_ECHO

ICMP_ECHO

icmp_echo

icmp_echo

0

0

11

11

ICMP_TIME_EXCEEDED

ICMP_TIME_EXCEEDED

icmp_unreach

icmp_unreach

1

1

12

12

ICMP_PARAMETERPROB

ICMP_PARAMETERPROB

icmp_unreach

icmp_unreach

1

1

13

13

ICMP_TIMESTAMP

ICMP_TIMESTAMP

icmp_timestamp

icmp_timestamp

0

0

17 号

17

ICMP_ADDRESS

ICMP_ADDRESS

icmp_address

icmp_address

0

0

18

18

ICMP_ADDRESSREPLY

ICMP_ADDRESSREPLY

icmp_address_reply

icmp_address_reply

0

0

图 25-11显示了 的内部结构icmp_rcv

Figure 25-11 shows the internals of icmp_rcv .

请注意,既不支持ICMP_ADDRESS也不ICMP_ADDRESSREPLY支持;针对它们注册的两个处理程序只是占位符或应用某种日志记录。

Note that neither ICMP_ADDRESS nor ICMP_ADDRESSREPLY is supported; the two handlers that are registered against them are just placeholders or apply some kind of logging.

icmp_rcv函数

图 25-11。icmp_rcv函数

Figure 25-11. icmp_rcv function

另请注意,icmp_unreach处理程序负责处理不同的 ICMP 消息类型,而不仅仅是ICMP_DEST_UNREACH.

Note also that the icmp_unreach handler takes care of different ICMP message types, not just ICMP_DEST_UNREACH.

图25-12(a)展示了当调用时一些 的 skb指针是如何初始化的,图25-12(b)展示了当表25-9的处理程序被调用时它们是如何初始化的。该图在分析表 25-9中的例程时尤其有用。icmp_rcvicmp_unreach

Figure 25-12(a) shows how some of skb's pointers are initialized when icmp_rcv is invoked, and Figure 25-12(b) shows how they are initialized when the handlers of Table 25-9 are called. This figure can be useful when analyzing the routines in Table 25-9, especially icmp_unreach.

(a) icmp_rcv 开头的 skb; (b) skb 被传递给处理程序

图 25-12。(a) icmp_rcv 开头的 skb;(b) skb 被传递给处理程序

Figure 25-12. (a) skb at the beginning of icmp_rcv; (b) skb as it is passed to the handler

处理 ICMP_ECHO 和 ICMP_ECHOREPLY 消息

Processing ICMP_ECHO and ICMP_ECHOREPLY Messages

ICMP_ECHO消息根据“回复入口 ICMP 消息”部分中描述的通用模型进行处理:

ICMP_ECHO messages are processed according to the generic model described in the section "Replying to Ingress ICMP Messages":

    静态无效 icmp_echo(结构 sk_buff *skb)
    {
        如果(!sysctl_icmp_echo_ignore_all){
            结构 icmp_bxm icmp_param;

            icmp_param.data.icmph = *skb->h.icmph;
            icmp_param.data.icmph.type = ICMP_ECHOREPLY;
            icmp_param.skb = skb;
            icmp_param.offset = 0;
            icmp_param.data_len = skb->len;
            icmp_param.head_len = sizeof(struct icmphdr);
            icmp_reply(&icmp_param, skb);
        }
    }
    static void icmp_echo(struct sk_buff *skb)
    {
        if (!sysctl_icmp_echo_ignore_all) {
            struct icmp_bxm icmp_param;

            icmp_param.data.icmph       = *skb->h.icmph;
            icmp_param.data.icmph.type  = ICMP_ECHOREPLY;
            icmp_param.skb              = skb;
            icmp_param.offset           = 0;
            icmp_param.data_len         = skb->len;
            icmp_param.head_len         = sizeof(struct icmphdr);
            icmp_reply(&icmp_param, skb);
        }
    }

ICMP_ECHOREPLY消息不是由内核处理,而是由生成相关ICMP_ECHO消息的应用程序处理。有关ping的示例,请参阅第 24 章中的“原始套接字和原始 IP ”部分。

ICMP_ECHOREPLY messages are not processed by the kernel, but by the applications that generated the associated ICMP_ECHO messages. See the section "Raw Sockets and Raw IP" in Chapter 24 for an example involving ping.

处理常见ICMP报文

Processing the Common ICMP Messages

icmp_unreach用作多种ICMP类型的处理程序,如表25-9所示。该函数从一些常见的健全性检查开始,继续根据特定消息类型进行一些处理,最后以另一个公共部分结束。

icmp_unreach is used as a handler for multiple ICMP types, as shown in Table 25-9. The function starts with some common sanity checks, continues with some processing based on the particular message type, and concludes with another common part.

该例程的内部结构如图25-13所示。

The internals of the routine are shown in Figure 25-13.

每个类型的处理是最少的:

The per-type processing is minimal:

  • 它打印 ICMP 的警告消息ICMP_SR_FAILED

  • It prints a warning message for ICMP_SR_FAILED ICMPs.

  • ICMP_DEST_UNREACH当它收到 type和 code的 ICMP 时,它会更新路由缓存ICMP_FRAG_NEEDED。缓存会通过 进行更新ip_rt_frag_needed,但前提是启用了 PMTU 发现(即,if ipv4_config.no_pmtu_disc为非零)。当未启用 PMTU 发现时,内核仅记录一条警告。

  • It updates the routing cache when it receives an ICMP of type ICMP_DEST_UNREACH and code ICMP_FRAG_NEEDED. The cache is updated with ip_rt_frag_needed, but only if PMTU discovery is enabled (i.e., if ipv4_config.no_pmtu_disc is nonzero). When PMTU discovery is not enabled, the kernel simply logs a warning.

  • pointer当消息类型为 时,它从 ICMP 标头中提取该字段ICMP_PARAMETERPROBpointer是相对于 ICMP 有效负载中 IP 标头开头的偏移量。该字段将被传递给传输协议。

  • It extracts the pointer field from the ICMP header when the message is of type ICMP_PARAMETERPROB. pointer is an offset relative to the beginning of the IP header in the ICMP payload. The field will be passed to the transport protocol.

  • ICMP_SOURCE_QUENCH不需要 中的任何特定处理icmp_unreach,因此完全由传输协议在通过例程通知时来处理它err_handler。目前,所有传输协议都会忽略此类 ICMP 消息。

  • ICMP_SOURCE_QUENCH does not require any specific treatment in icmp_unreach, so it is completely up to the transport protocols to handle it when notified via the err_handler routines. Currently, all transport protocols ignore this type of ICMP message.

对于 和ICMP_FRAG_NEEDEDICMP_SR_FAILED记录速率通过 进行限制LIMIT_NETDEBUG,这是一个通用例程,将网络相关消息的速率限制为每秒 5 条。

For both ICMP_FRAG_NEEDED and ICMP_SR_FAILED, the logging is rate limited via LIMIT_NETDEBUG, which is a generic routine that rate limits networking-related messages to five per second.

最后一部分对于icmp_unreach将其用作处理程序的所有 ICMP 类型来说也是通用的,并且包含以下任务:

The last part of icmp_unreach is again common to all ICMP types that use it as a handler, and consists of the following tasks:

  • 设置该sysctl_icmp_ignore_bogus_error_messages变量后(默认情况下未设置),如果通过广播 IP 数据包接收 ICMP 消息,则该消息将被丢弃。

  • When the sysctl_icmp_ignore_bogus_error_messages variable is set (by default, it is not), the ICMP message is discarded if it is received with a broadcast IP packet.

  • 该功能确保 ICMP 有效负载包含触发生成 ICMP 消息的 IP 数据包的整个 IP 标头,以及同一 IP 数据包的传输有效负载中的 64 位。此信息是必要的

    icmp_unreach 函数

    图 25-13。icmp_unreach 函数

    允许传输协议识别本地套接字(即应用程序)。当不满足此条件时,ICMP 消息将被丢弃。请注意,64 位要求来自 RFC 792,但 RFC 1812 更改了要求(请参阅“ ICMP 有效负载”部分)。

  • The function makes sure the ICMP payload includes the whole IP header of the IP packet that triggered the generation of the ICMP message, plus 64 bits from the transport payload of the same IP packet. This information is necessary to

    Figure 25-13. icmp_unreach function

    allow the transport protocol to identify a local socket (i.e., the application). When this condition is not met, the ICMP message is dropped. Note that the 64-bit requirement comes from RFC 792, but RFC 1812 changed the requirement (see the section "ICMP Payload").

  • 该函数通过该函数向传输协议通知此 ICMP 消息 err_handler。使用 ICMP 有效负载中 IP 标头的协议字段来识别正确的传输协议。请参见“将错误通知传递到传输层”部分和图25-2

  • The function notifies the transport protocol about this ICMP message via the err_handler function. The right transport protocol is identified using the protocol field of the IP header in the ICMP payload. See the section "Passing Error Notifications to the Transport Layer" and Figure 25-2.

处理 ICMP_REDIRECT 消息

Processing ICMP_REDIRECT Messages

icmp_redirect,用于处理传入ICMP_REDIRECT 消息的函数,是带有一些附加健全性检查的包装器ip_rt_redirect后一个函数使用的逻辑在第 31 章的“处理入口 ICMP_REDIRECT 消息”一节中描述。使用 向路由缓存添加一个条目 ,这将在第 33 章中进行描述。该路由在标记打开的情况下进行初始化 ,以与其他路由区分开来。例如,我们将在第 30 章的“符合条件的缓存受害者的示例”部分中看到ip_rt_redirectrt_intern_hashRTCF_REDIRECTED当路由代码被迫从路由缓存中删除条目时,它如何使用此信息。

icmp_redirect, the function used to process incoming ICMP_REDIRECT messages, is a wrapper around ip_rt_redirect with some additional sanity checks. The logic used by the latter function is described in the section "Processing Ingress ICMP_REDIRECT Messages" in Chapter 31. ip_rt_redirect adds an entry to the routing cache with rt_intern_hash, which is described in Chapter 33. The route is initialized with the RTCF_REDIRECTED flag toggled on, to be distinguished from the other routes. For example, we will see in the section "Examples of eligible cache victims" in Chapter 30 how the routing code uses this information when it is forced to delete entries from the routing cache.

系统管理员还可以影响 ICMP 重定向的生成时间。通过/proc文件系统,可以为每个接口指定是否发送和接受 ICMP 重定向(请参阅第36 章中的“ /proc/sys/net/ipv4/conf 目录”部分)。管理员还可以使用防火墙功能指定接受来自谁的特定类型的 ICMP 数据包,从而指定信任谁的消息。ICMP_REDIRECT

The system administrator can also influence when ICMP redirects are generated. Through the /proc filesystem, it is possible to specify for each interface whether to send and accept ICMP redirects (see the section "The /proc/sys/net/ipv4/conf Directory" in Chapter 36). Using the firewall capabilities, as well, the administrator can specify from whom to accept particular types of ICMP packets and therefore whose ICMP_REDIRECT messages to trust.

处理 ICMP_TIMESTAMP 和 ICMP_TIMESTAMPREPLY 消息

Processing ICMP_TIMESTAMP and ICMP_TIMESTAMPREPLY Messages

入口ICMP_TIMESTAMP消息通过回复消息来处理,使用“回复入口 ICMP 消息ICMP_TIMESTAMPREPLY”部分中讨论的方案。第二个和第三个时间戳没有按照我们在“ ICMP_TIMESTAMP 和 ICMP_TIMESTAMPREPLY ”部分中看到的规则进行初始化:它们被初始化为与 相同的时间戳。do_gettimeofday

Ingress ICMP_TIMESTAMP messages are handled by replying with an ICMP_TIMESTAMPREPLY message, using the scheme discussed in the section "Replying to Ingress ICMP Messages." The second and third timestamps are not initialized according to the rules we saw in the section "ICMP_TIMESTAMP and ICMP_TIMESTAMPREPLY": they are initialized to the same timestamp with do_gettimeofday.

请注意,head_len初始化不仅包括默认的 ICMP 标头长度,还包括三个 32 位时间戳。

Note that head_len is initialized to include not only the default ICMP header length, but also the three 32-bit timestamps.

处理 ICMP_ADDRESS 和 ICMP_ADDRESSREPLY 消息

Processing ICMP_ADDRESS and ICMP_ADDRESSREPLY Messages

由于 Linux 内核不生成ICMP_ADDRESS消息,因此入口ICMP_ADDRESSREPLY消息不能作为本地生成的查询的答案(至少不在内核空间中)。但是,当在入口设备上启用火星地址[ * ]ICMP_ADDRESSREPLY的转发和记录时,Linux 会侦听带有icmp_address_reply. 后一个功能检查消息中通告的掩码相对于接收接口上配置的 IP 地址是否正确:如果接收接口没有在 ICMP 消息使用的源 IP 地址的同一子网上配置任何 IP 地址发送者(这也意味着完全相同的网络掩码),内核会记录一条警告。

Because the Linux kernel does not generate ICMP_ADDRESS messages, ingress ICMP_ADDRESSREPLY messages cannot be answers to queries generated locally (not in kernel space, at least). However, when forwarding and logging of Martian addresses[*] are enabled on the ingress device, Linux listens to ICMP_ADDRESSREPLY messages with icmp_address_reply. The latter function checks whether the mask advertised with the message is correct with regard to the IP addresses configured on the receiving interface: if the receiving interface does not have any IP address configured on the same subnet of the source IP address used by the ICMP message sender (which also implies the exact same netmask), the kernel logs a warning.

当路由缓存设置了标志时,不会对收到的回复进行健全性检查 RTCF_DIRECTSRC。仅当本地主机可通过具有本地范围(即仅存在于 Linux 机器内部)的下一跳到达目标地址时,才会设置此标志。

The sanity check on the received reply is not done when the routing cache has the RTCF_DIRECTSRC flag set. This flag is set only when the destination address is reachable by the local host via a next hop that has local scope (i.e., that exists only internally to the Linux box).

ICMP 统计

ICMP Statistics

ICMP 协议保留 RFC 2011 中定义的统计信息,并将它们存储在icmp_mib数据结构中。内核维护每个 CPU 的统计信息,并且对于每个 CPU,它区分在软件中断上下文中更新的统计信息和在该上下文之外更新的统计信息。换句话说,对于每个计数器,每个 CPU 有两个实例:这两个实例之一由在软件中断上下文中运行的代码使用,另一个由不在软件中断上下文中运行的代码使用。所有这些icmp_mib实例都是由init_ipv4_mibsnet /ipv4/af_inet.c分配的。 icmp_statistics是一个二元素数组,其第一个元素代表每个 CPU 的数组icmp_mib 由在软件中断上下文中运行的代码使用的实例,其第二个元素代表另一个每 CPU 数组。

The ICMP protocol keeps the statistics defined in RFC 2011, storing them in icmp_mib data structures. The kernel maintains statistics on a per-CPU basis, and for each CPU it distinguishes between statistics updated in software interrupt context and those updated outside that context. In other words, for each counter there are two instances per CPU: one of those two instances is used by code running in software interrupt context and the other is used by code not running in software interrupt context. All of those icmp_mib instances are allocated by init_ipv4_mibs in net/ipv4/af_inet.c. icmp_statistics is a two-element array, whose first element represents the per-CPU array of icmp_mib instances used by code that runs in software interrupt context, and whose second element represents the other per-CPU array.

    静态 int _ _init init_ipv4_mibs(void)
    {
        ...
        icmp_statistics[0] = alloc_percpu(struct icmp_mib);
       icmp_statistics[1] = alloc_percpu(struct icmp_mib);
        ...
    }
    static int _ _init init_ipv4_mibs(void)
    {
        ...
        icmp_statistics[0] = alloc_percpu(struct icmp_mib);
       icmp_statistics[1] = alloc_percpu(struct icmp_mib);
        ...
    }

icmp_mib结构由一组 unsigned long成员组成,每个成员对应 RFC 2011 中为 ICMP 协议定义的每个计数器:

The icmp_mib structure consists of an array of unsigned long members, one for each counter defined in RFC 2011 for the ICMP protocol:

    #定义 SNMP_MIB_DUMMY _ _ICMP_MIB_MAX
    #定义 ICMP_MIB_MAX (ICMP_MIB_MAX + 1)
    结构 icmp_mib {
        无符号长 mib[ICMP_MIB_MAX];
    } _ _SNMP_MIB_ALIGN_ _;
    #define SNMP_MIB_DUMMY  _ _ICMP_MIB_MAX
    #define ICMP_MIB_MAX    (ICMP_MIB_MAX + 1)
    struct icmp_mib {
        unsigned long mibs[ICMP_MIB_MAX];
    } _ _SNMP_MIB_ALIGN_ _;

计数器通过枚举列表来标识,在include/linux/snmp.hICMP_MIB_ XXX中定义:

The counters are identified via the enumeration list ICMP_MIB_ XXX, defined in include/linux/snmp.h:

    枚举
    {
        ICMP_MIB_NUM = 0,
        ICMP_MIB_INMSG,
        ...
        ICMP_MIB_OUTADDRMASKREPS,
        _ _ICMP_MIB_MAX
    }
    enum
    {
        ICMP_MIB_NUM = 0,
        ICMP_MIB_INMSG,
        ...
        ICMP_MIB_OUTADDRMASKREPS,
        _ _ICMP_MIB_MAX
    }

请注意,数组的大小比枚举列表icmp_mib的大小大一个单位。ICMP_MIB_ XXX额外元素用于解释内核无法识别的 ICMP 消息类型。

Note that the size of the icmp_mib array is one unit bigger than the size of the ICMP_MIB_ XXX enumeration list. The extra element is used to account for ICMP message types not recognized by the kernel.

任何时候,当内核需要更新给定计数器时,它都会icmp_statistics根据中断上下文选择正确的元素,然后icmp_mib根据当前 CPU 选择正确的实例。内核在include/net/icmp.h中提供了一组宏,它们只需要输入中的计数器标识符(即 ICMP_MIB_ XXX),并透明地处理刚才描述的两个选择:

At any time, when the kernel needs to update a given counter, it selects the right element of icmp_statistics based on the interrupt context, and then the right icmp_mib instance based on the current CPU. The kernel provides a set of macros in include/net/icmp.h that need only the counter identifier in input (i.e., ICMP_MIB_ XXX) and transparently take care of the two selections just described:

ICMP_INC_STATS
ICMP_INC_STATS

该宏可以在软件中断上下文内部和外部使用。

This macro can be used both in and outside of software interrupt context.

ICMP_INC_STATS_BH
ICMP_INC_STATS_BH

当需要更新计数器的代码始终在软件中断上下文中运行时,可以使用该宏。

This macro can be used when the code that needs to update a counter always runs in software interrupt context.

ICMP_INC_STATS_USER
ICMP_INC_STATS_USER

当需要更新计数器的代码从未在软件中断上下文中运行时,可以使用此宏。

This macro can be used when the code that needs to update a counter never runs in software interrupt context.

这三个宏被定义为 SNMP 子系统提供的通用宏的包装器:

The three macros are defined as wrappers around generic macros provided by the SNMP subsystem:

    #define ICMP_INC_STATS(字段) SNMP_INC_STATS(icmp_statistics, 字段)
    #define ICMP_INC_STATS_BH(字段) SNMP_INC_STATS_BH(icmp_statistics, 字段)
    #define ICMP_INC_STATS_USER(字段) SNMP_INC_STATS_USER(icmp_statistics, 字段)
    #define ICMP_INC_STATS(field)      SNMP_INC_STATS(icmp_statistics, field)
    #define ICMP_INC_STATS_BH(field)   SNMP_INC_STATS_BH(icmp_statistics, field)
    #define ICMP_INC_STATS_USER(field) SNMP_INC_STATS_USER(icmp_statistics, field)

这里是值的含义ICMP_MIB_ XXX。更详细的描述可以参考RFC 2011。

Here is the meaning of the ICMP_MIB_ XXX values. For a more detailed description, you can refer to RFC 2011.

与收到的 ICMP 消息相关的字段

Fields related to received ICMP messages

ICMP_MIB_INMSG
ICMP_MIB_INMSG

收到的 ICMP 消息数。它包括由 计算的那些消息ICMP_MIB_INERRORS

Number of received ICMP messages. It includes those messages that are accounted by ICMP_MIB_INERRORS.

ICMP_MIB_INERRORS
ICMP_MIB_INERRORS

由于某些问题而丢弃的 ICMP 消息数。当表 25-9icmp_rcv中的处理程序具有截断的 ICMP 标头时,它们会丢弃入口消息。当入口消息具有截断的 ICMP 负载时, “将错误通知传递到传输层”部分中描述的L4 层功能会丢弃入口消息。err_handler

Number of ICMP messages dropped because of some problem. icmp_rcv and the handlers in Table 25-9 drop ingress messages when they have a truncated ICMP header. The L4 layer err_handler function described in the section "Passing Error Notifications to the Transport Layer" drops ingress messages when they have a truncated ICMP payload.

ICMP_MIB_IN XXX
ICMP_MIB_IN XXX

除了刚刚列出的两个通用计数器之外,还有一种每 ICMP 消息类型。ICMP_MIB_IN XXX计算收到的 ICMP 消息的数量 XXX

Besides the two general-purpose counters just listed, there is one per-ICMP message type. ICMP_MIB_IN XXX counts the number of ICMP messages of type XXX received.

ICMP_MIB_INXXX 每个ICMP_MIB_OUT XXX计数器的对应项:

ICMP_MIB_INXXX counterpart for each ICMP_MIB_OUT XXX counter:

ICMP_MIB_OUTMSG
ICMP_MIB_OUTMSG

传输的 ICMP 消息数。

Number of transmitted ICMP messages.

ICMP_MIB_OUTERRORS
ICMP_MIB_OUTERRORS

错误 ICMP 传输的数量。不曾用过。

Number of faulty ICMP transmissions. Not used.

ICMP_MIB_OUTXXX
ICMP_MIB_OUTXXX

除了刚刚列出的两个通用计数器之外,还有一种每 ICMP 消息类型。ICMP_MIB_OUT XXX计算已传输类型的 ICMP 消息的数量 XXX

Besides the two general-purpose counters just listed, there is one per-ICMP message type. ICMP_MIB_OUT XXX counts the number of ICMP messages of type XXX transmitted.

ICMP_MIB_INXXX计数器在 中更新icmp_rcv

ICMP_MIB_INXXX counters are updated in icmp_rcv.

ICMP_MIB_OUTXXXicmp_reply计数器在以下icmp_send调用中更新icmp_out_count

ICMP_MIB_OUTXXX counters are updated within icmp_reply and icmp_send by invoking icmp_out_count:

    静态无效 icmp_out_count(int 类型)
    {
        如果(类型 <= NR_ICMP_TYPES){
            ICMP_INC_STATS(icmp_pointers[类型].output_entry);
            ICMP_INC_STATS(ICMP_MIB_OUTMSGS);
        }
    }
    static void icmp_out_count(int type)
    {
        if (type <= NR_ICMP_TYPES) {
            ICMP_INC_STATS(icmp_pointers[type].output_entry);
            ICMP_INC_STATS(ICMP_MIB_OUTMSGS);
        }
    }

在这两种情况下,对于任何 ICMP 类型t,要递增的正确计数器是通过与t关联的数据结构的input_entryoutput_entry字段 来标识的。这些计数器的值导出到/proc/net/snmp文件中。您还可以使用netstat -s(当然还有 SNMP 代理)来读取它们。icmp_control

In both cases, for any ICMP type t, the right counter to increment is identified by means of the input_entry and output_entry fields of the icmp_control data structure associated with t. The values of these counters are exported in the /proc/net/snmp file. You can also read them with netstat -s (and with SNMP agents, of course).

将错误通知传递到传输层

Passing Error Notifications to the Transport Layer

我们在第24章的“ L4协议注册”一节中看到,当传输协议向内核注册时,它们提供了一个数据结构的实例。它包括一个函数指针 ,由 ICMP 协议调用,以传播到随入口 ICMP 消息接收的传输层错误通知。RFC 1122 和 1256 分别为主机和路由器指定是否应将每种 ICMP 消息类型传播到传输层。所有需要向传输层发送通知的错误消息类型均由 进行处理。在该函数结束时,将通过 通知传输层。inet_protocolerr_handlericmp_unreacherr_handler

We saw in the section "L4 Protocol Registration" in Chapter 24 that when transport protocols register with the kernel, they provide an instance of an inet_protocol data structure. It includes one function pointer, err_handler, which is called by the ICMP protocol to propagate to the transport layer error notifications received with ingress ICMP messages. RFCs 1122 and 1256 specify, for hosts and routers, respectively, whether each ICMP message type should be propagated to the transport layer. All the error message types that require a notification to be sent to the transport layer are processed by icmp_unreach. At the end of that function, the transport layer is notified with err_handler.

当传输层处理通知时,它使用net/ipv4/icmp.cicmp_err_convert中定义的数组将代码转换为套接字层更好理解的错误代码(请参阅net/ipv4/udp.c中的示例) )。传输层将该错误代码传递到与错误关联的套接字(通过 ICMP 有效负载进行识别,如“ ICMP 有效负载”部分中所述)。原始 IP 套接字也会通过. 表 25-10显示了 所应用的转换。请注意,ICMP_DEST_UNREACHudp_errraw_erricmp_err_converterr_handler由隧道协议(例如 IPIP 和 GRE)注册的例程可能会生成新的 ICMP 消息(请参阅net/ipv4/ipip.cipip_err中的示例)。

When the transport layer processes the notification, it uses the icmp_err_convert array defined in net/ipv4/icmp.c to convert the ICMP_DEST_UNREACH code into an error code that is better understood by the socket layer (see udp_err in net/ipv4/udp.c for an example). The transport layer passes that error code to the socket associated with the error (which is identified thanks to the ICMP payload, as described in the section "ICMP Payload"). Raw IP sockets are notified as well, by means of raw_err. Table 25-10 shows the conversion that is applied by icmp_err_convert. Note that the err_handler routines registered by tunneling protocols such as IPIP and GRE may generate new ICMP messages (see ipip_err in net/ipv4/ipip.c for an example).

表 25-10。icmp_err_convert的初始化

Table 25-10. Initialization of icmp_err_convert

代码

Code

内核符号

Kernel symbol

错误号

errno

致命(0=否,1=是)

Fatal (0=No, 1=Yes)

0

0

ICMP_NET_UNREACH

ICMP_NET_UNREACH

ENETUNREACH

ENETUNREACH

0

0

1

1

ICMP_HOST_UNREACH

ICMP_HOST_UNREACH

EHOSTUNREACH

EHOSTUNREACH

0

0

2

2

ICMP_PROT_UNREACH

ICMP_PROT_UNREACH

ENOPROTOOPT

ENOPROTOOPT

1

1

3

3

ICMP_PORT_UNREACH

ICMP_PORT_UNREACH

ECONNREFUSED

ECONNREFUSED

1

1

4

4

ICMP_FRAG_NEEDED

ICMP_FRAG_NEEDED

EMSGSIZE

EMSGSIZE

0

0

5

5

ICMP_SR_FAILED

ICMP_SR_FAILED

EOPNOTSUPP

EOPNOTSUPP

0

0

6

6

ICMP_NET_UNKNOWN

ICMP_NET_UNKNOWN

ENETUNREACH

ENETUNREACH

1

1

7

7

ICMP_HOST_UNKNOWN

ICMP_HOST_UNKNOWN

EHOSTDOWN

EHOSTDOWN

1

1

8

8

ICMP_HOST_ISOLATED

ICMP_HOST_ISOLATED

ENONET

ENONET

1

1

9

9

ICMP_NET_ANO

ICMP_NET_ANO

ENETUNREACH

ENETUNREACH

1

1

10

10

ICMP_HOST_ANO

ICMP_HOST_ANO

EHOSTUNREACH

EHOSTUNREACH

1

1

11

11

ICMP_NET_UNR_TOS

ICMP_NET_UNR_TOS

ENETUNREACH

ENETUNREACH

0

0

12

12

ICMP_HOST_UNR_TOS

ICMP_HOST_UNR_TOS

EHOSTUNREACH

EHOSTUNREACH

0

0

13

13

ICMP_PKT_FILTERED

ICMP_PKT_FILTERED

EHOSTUNREACH

EHOSTUNREACH

1

1

14

14

ICMP_PREC_VIOLATION

ICMP_PREC_VIOLATION

EHOSTUNREACH

EHOSTUNREACH

1

1

15

15

ICMP_PREC_CUTOFF

ICMP_PREC_CUTOFF

EHOSTUNREACH

EHOSTUNREACH

1

1

通过 /proc 文件系统进行调整

Tuning via /proc Filesystem

ICMP 协议没有编译时内核选项;所有调整参数都在net/ipv4/sysctl_net_ipv4.c中定义,并通过目录/proc/sys/net/ipv4中的/proc文件系统导出:

There are no compile-time kernel options for the ICMP protocol; all the tuning parameters are defined in net/ipv4/sysctl_net_ipv4.c and are exported via the /proc filesystem in the directory /proc/sys/net/ipv4:

icmp_echo_ignore_all
icmp_echo_ignore_all

icmp_echo该标志由传入 ICMP 消息的处理程序使用来ICMP_ECHO决定是否回复。出于安全原因,这种过滤通常由防火墙完成;然而,ICMP 子系统也提供了该功能。

This flag is used by icmp_echo, the handler for incoming ICMP_ECHO ICMP messages, to decide whether to reply. This kind of filtering is usually done for security reasons by firewalls; however, the ICMP subsystem provides the capability, too.

icmp_echo_ignore_broadcasts
icmp_echo_ignore_broadcasts

设置此标志后,ICMP_ECHO发送到广播地址的消息将被忽略。有关示例,请参见第 30 章中的“定向广播”部分。该字段的值已签入。icmp_rcv

When this flag is set, ICMP_ECHO messages sent to broadcast addresses are ignored. See the section "Directed Broadcasts" in Chapter 30 for an example. The value of this field is checked in icmp_rcv.

icmp_ignore_bogus_error_responses
icmp_ignore_bogus_error_responses

当此标志被清除时,带有广播目标 IP 地址的 ICMP 错误消息类型将被忽略。icmp_unreach处理标志。

When this flag is clear, ICMP error message types with a broadcast destination IP address are ignored. icmp_unreach handles the flag.

icmp_errors_use_inbound_ifaddr
icmp_errors_use_inbound_ifaddr

该标志用于更改本地主机传输 ICMP 错误消息时选择源 IP 地址的方式。当未设置该标志时,Linux 从将用于传输 ICMP 消息的接口中选择源 IP 地址(请参阅第七部分)。当设置该标志时,Linux 从接收触发 ICMP 消息传输的 IP 数据包的接口中选择源 IP 地址。

在大多数情况下,两个接口匹配,但它们可能不同,例如,当两个主机可以通过非对称路由到达时(请参阅第 30 章中的“路由的基本元素” 部分)。

This flag is used to change how the source IP address is chosen when the local host transmits an ICMP error message. When the flag is not set, Linux selects the source IP address from the interface that is going to be used to transmit the ICMP message (see Part VII). When the flag is set, Linux selects the source IP address from the interface that received the IP packet that triggered the transmission of the ICMP message.

In most cases, the two interfaces match, but they could differ, for example, when two hosts are reachable with asymmetric routes (see the section "Essential Elements of Routing" in Chapter 30).

icmp_ratelimit
icmp_ratelimit

icmp_ratemask
icmp_ratemask

ICMP 使用这两个变量来限制传出 ICMP 消息的速率(请参阅“速率限制”部分)。 sysctl_icmp_ratemask只是一个位图,其中每个位(从最低有效位开始)代表一个 ICMP 类型:如果XXX设置了与类型相对应的位,则该类型的传出 ICMP 消息XXX将受到速率限制。

These two variables are used by ICMP to rate limit outgoing ICMP messages (see the section "Rate Limiting"). sysctl_icmp_ratemask is simply a bitmap where each bit (starting from the least-significant bit) represents an ICMP type: if the bit corresponding to type XXX is set, outgoing ICMP messages of type XXX are rate limited.

表 25-11总结了变量和关联文件。

Table 25-11 summarizes the variables and associated files.

表 25-11。/proc/sys/net/ipv4 可用于调整 ICMP 子系统的文件

Table 25-11. /proc/sys/net/ipv4 files usable for tuning the ICMP subsystem

内核变量

Kernel variable

文件名

Filename

默认值

Default value

a假定每个位代表一个 ICMP 类型,并给定表 25-1中的类型,该位图包括以下类型:ICMP_DEST_UNREACHICMP_SOURCE_QUENCHICMP_TIME_EXCEEDEDICMP_PARAMETERPROB

aGiven that each bit represents an ICMP type, and given the types in Table 25-1, this bitmap includes the following types: ICMP_DEST_UNREACH, ICMP_SOURCE_QUENCH, ICMP_TIME_EXCEEDED, and ICMP_PARAMETERPROB.

sysctl_icmp_echo_ignore_all

sysctl_icmp_echo_ignore_all

icmp_echo_ignore_all

icmp_echo_ignore_all

0

0

sysctl_icmp_echo_ignore_broadcasts

sysctl_icmp_echo_ignore_broadcasts

icmp_echo_ignore_broadcasts

icmp_echo_ignore_broadcasts

0

0

sysctl_icmp_ignore_bogus_error_responses

sysctl_icmp_ignore_bogus_error_responses

icmp_ignore_bogus_error_responses

icmp_ignore_bogus_error_responses

0

0

sysctl_icmp_errors_use_inbound_ifaddr

sysctl_icmp_errors_use_inbound_ifaddr

icmp_errors_use_inbound_ifaddr

icmp_errors_use_inbound_ifaddr

0

0

sysctl_icmp_ratelimit

sysctl_icmp_ratelimit

icmp_ratelimit

icmp_ratelimit

1*赫兹

1 * HZ

sysctl_icmp_ratemask

sysctl_icmp_ratemask

icmp_ratemask

icmp_ratemask

0x1818一个

0x1818a

本章介绍的函数和变量

Functions and Variables Featured in This Chapter

表25-12总结了本章介绍的主要函数、变量和数据结构。

Table 25-12 summarizes the main functions, variables, and data structures introduced in this chapter.

表 25-12。本章介绍的函数、变量和数据结构

Table 25-12. Functions, variables, and data structures introduced in this chapter

姓名

Name

描述

Description

Functions

Functions

 

icmp_init

icmp_init

初始化 ICMPv4 协议。请参阅“协议初始化”部分。

Initializes the ICMPv4 protocol. See the section "Protocol Initialization."

icmp_rcv

icmp_rcv

处理入口 ICMP 消息。请参阅“接收 ICMP 消息”部分。

Processes ingress ICMP messages. See the section "Receiving ICMP Messages."

icmp_send icmp_reply

icmp_send icmp_reply

传输 ICMP 消息。请参阅“传输 ICMP 消息”部分。

Transmit an ICMP message. See the section "Transmitting ICMP Messages."

icmp_xmit_lock icmp_xmit_unlock

icmp_xmit_lock icmp_xmit_unlock

获取并释放每个 CPU ICMP 套接字的传输锁定。

Get and release the per-CPU ICMP-socket's transmit lock.

icmpv4_xrlim_allow xrlim_allow

icmpv4_xrlim_allow xrlim_allow

限制 ICMP 消息传输的速率。请参阅“速率限制”部分。

Rate limit ICMP message transmissions. See the section "Rate Limiting."

icmp_out_count

icmp_out_count

更新已传输 ICMP 消息的 SNMP 计数器。

Updates SNMP counters for transmitted ICMP messages.

icmp_err_convert

icmp_err_convert

将 ICMP 错误代码转换为套接字错误代码。请参阅“将错误通知传递到传输层”部分。

Converts ICMP error codes to socket error codes. See the section "Passing Error Notifications to the Transport Layer."

ICMP_INC_STATS ICMP_INC_STATS_BH ICMP_INC_STATS_USER

ICMP_INC_STATS ICMP_INC_STATS_BH ICMP_INC_STATS_USER

增量计数器用于保存 ICMP 消息的统计信息。请参阅“ ICMP 统计信息”部分。

Increment counters used to keep statistics on ICMP messages. See the section "ICMP Statistics."

Variables

Variables

 

icmp_statistics

icmp_statistics

SNMP 计数器。请参阅“ ICMP 统计信息”部分。

SNMP counters. See the section "ICMP Statistics."

Data structures

Data structures

 

struct icmphdr struct icmp_control struct icmp_bxm

struct icmphdr struct icmp_control struct icmp_bxm

ICMPv4 使用的主要数据结构。请参阅“本章介绍的数据结构”部分。

Main data structures used by ICMPv4. See the section "Data Structures Featured in This Chapter."

icmp_mib

icmp_mib

计数器数组。请参阅“ ICMP 统计信息”部分。

Array of counters. See the section "ICMP Statistics."

本章介绍的文件和目录

Files and Directories Featured in This Chapter

ICMP 子系统仅使用 5 个文件,其中两个用于 IPv4,两个用于 IPv6,一个由两个 IP 版本共享,如图25-14所示。

The ICMP subsystem uses only five files—two for IPv4, two for IPv6, and one shared by the two IP versions—as shown in Figure 25-14.

本章介绍的文件和目录

图 25-14。本章介绍的文件和目录

Figure 25-14. Files and directories featured in this chapter




[ * ]请参阅 RFC 1812,第 4.3.3.2 节和 5.2.7.2 节。

[*] See RFC 1812, sections 4.3.3.2 and 5.2.7.2.

[ * ]还有第三个选项,基于 Linux 不支持的 IP 选项 (RFC 1393) 的使用。 最常见的 Linux 发行版附带的Traceroute版本不支持 RFC 1393。

[*] There is a third option, based on the use of an IP option (RFC 1393) that is not supported by Linux. The version of traceroute that comes with the most common Linux distributions does not support RFC 1393.

[ * ]这是 RFC 1812 第 4.3.2.5 节中所要求的。

[*] This is required by RFC 1812 in section 4.3.2.5.

[ * ]请注意,内核使用的策略与防火墙使用的策略无关。例如,防火墙通常会丢弃除少数 ICMP 消息之外的所有消息。有时,PMTU 使用的那些也会被丢弃,即使它违反了 RFC 建议。

[*] Note that the policy used by the kernel has nothing to do with the one used by the firewall. It is common, for instance, for firewalls to drop all but a few ICMP messages. Sometimes the ones used by PMTU are dropped too, even though it goes against the RFC recommendations.

[ * ]参见第36章“文件描述”部分中log_martians的定义。

[*] See the definition of log_martians in the section "File descriptions" in Chapter 36.

第六部分。邻近子系统

Part VI. Neighboring Subsystem

数据包使用第三层协议(例如 IP)到达 LAN,然后使用第二层协议(例如以太网)从本地网络上的路由器发送到运行端点应用程序的系统。但在这种情况下缺少一个步骤。路由器和应用程序主机如何知道对方是谁?用更专业的术语来说,主机如何找到与给定 IP 地址对应的 L2 地址(例如 MAC 地址)?查找与给定 L3 地址相关联的 L2 地址的操作称为“解析 L3 地址”。缺失的部分由相邻协议填充

Packets use a Layer three protocol such as IP to reach a LAN, and then a Layer two protocol such as Ethernet to go from the router on the local network to the system where the endpoint application is running. But a step is missing in this scenario. How do the router and the application host know who each other are? In more technical terms, how can a host find the L2 address (such as a MAC address) that corresponds to a given IP address? The action of finding the L2 address associated with a given L3 address is referred to as "resolving the L3 address."The missing piece is filled in by a neighboring protocol.

最熟悉的邻居协议是地址解析协议(ARP),第 28 章对其进行了一般性描述。IPv6 中使用的相应协议是邻居发现 (ND)。但是,相邻协议以及操作系统内的相邻子系统的关键原理和任务是可以概括的。

The most familiar neighboring protocol is Address Resolution Protocol (ARP), and Chapter 28 describes it in general terms. The corresponding protocol used in IPv6 is Neighbor Discovery (ND). But the key principles and tasks of a neighboring protocol, and a neighboring subsystem within an operating system, can be generalized.

以下是每章讨论的内容:

Here is what each chapter discusses:

第26章 相邻子系统:概念
Chapter 26 Neighboring Subsystem: Concepts

描述为什么以及何时使用相邻协议并列出其主要任务。

Describes why and when a neighboring protocol is used and lays out its major tasks.

第27章 相邻子系统:基础设施
Chapter 27 Neighboring Subsystem: Infrastructure

讨论所有相邻协议通用的基础设施。

Discusses the infrastructure that is common to all neighboring protocols.

第28章 邻居子系统:地址解析协议(ARP)
Chapter 28 Neighboring Subsystem: Address Resolution Protocol (ARP)

描述 ARP(最常见的邻居协议,也是读者最有可能与之交互的协议)如何使用基础设施。

Describes how ARP, the most common neighboring protocol and the one readers are most likely to have interacted with, uses the infrastructure.

第29章 相邻子系统:杂项主题
Chapter 29 Neighboring Subsystem: Miscellaneous Topics

涵盖命令行和用户空间界面(包括/proc 文件系统中相邻子系统的目录)。

Covers the command-line and user-space interface (including the neighboring subsystem's directories in the /proc filesystem).

第 26 章相邻子系统:概念

Chapter 26. Neighboring Subsystem: Concepts

本章介绍了使用邻居协议的原因和时间,并列出了其主要任务。它故意是一个总体概述,仅传递对特定邻居协议(例如 ARP)的引用。它涵盖了以下一般问题:

This chapter describes why and when a neighboring protocol is used and lays out its major tasks. It is deliberately a general overview that makes only passing references to particular neighboring protocols such as ARP. It covers such general issues as:

  • 一般邻近基础设施承担的任务

  • The tasks taken on by a general neighboring infrastructure

  • 为什么缓存很有价值

  • Why caching is valuable

  • 缓存中的邻居条目可以采取的状态

  • The states a neighbor entry in the cache can take

  • 可达性检测和网络不可达检测(NUD)

  • Reachability detection and Network Unreachability Detection (NUD)

  • 代理的用途是什么

  • What proxying is for

Linux 内核源代码中使用的术语遵循 RFC 2461 的“邻居协议”部分中描述的 IPv6 邻居发现模型,但我们将尽力使讨论保持与协议无关。

The terminology used in the Linux kernel source code follows the IPv6 neighbor discovery model described in RFC 2461 in the section "Neighboring Protocols," but we will try to keep the discussion as protocol-independent as possible.

术语L2 地址第二层 地址硬件地址MAC 地址链路层地址通常用于指代相同的概念。在本章中,我们将主要使用第一个术语。

The terms L2 address, Layer two address, hardware address, MAC address, and link layer address are commonly used to refer to the same concept. In this chapter, we will mostly use the first term.

什么是邻居?

What Is a Neighbor?

房东是您的 邻居如果它连接到同一 LAN(即,您通过共享介质或点对点链路直接连接到它)并且它配置在同一 L3 网络上。例如,在 IP 网络上,如果两台主机连接到同一 LAN 并且每台主机在同一 IP 子网上至少有一个接口,则可以说这两台主机是邻居。两个这样的主机可以使用与连接它们的介质(例如,以太网)相关联的协议直接通话。定义邻居的另一种方法是,主机与其邻居之间必须只有一个 L3 跳;它的 L3 路由表必须提供一种直接与邻居通信的方式。非邻居的主机必须通过网关或路由器进行通信。

A host is your neighbor if it is connected to the same LAN (i.e., you are directly connected to it through either a shared medium or a point-to-point link) and it is configured on the same L3 network. For example, on an IP network, you can say that two hosts are neighbors if they are connected to the same LAN and each has at least one interface on the same IP subnet. Two such hosts can speak directly using the protocol associated with the medium that connects them (e.g., Ethernet). Another way to define a neighbor is to say that a host must be only one L3 hop away from its neighbor; its L3 routing table must provide a way for it to talk directly to the neighbor. Hosts that are not neighbors must communicate through a gateway or router.

如果两台主机被 L2 层的系统(网桥)隔开,它们仍然可以是邻居。第四部分更详细地讨论了这一点,但我们将在此处基于图 26-1的 IP 网络查看一些简单的示例。

Two hosts can still be neighbors if they are separated by a system on the L2 layer (a bridge). Part IV goes into more detail on this point, but we'll look at some simple examples here based on the IP networks of Figure 26-1.

邻近和非邻近主机

图 26-1。邻近和非邻近主机

Figure 26-1. Neighboring and non-neighboring hosts

图 26-1中的每个拓扑显示了 L3 和 L2 地址之间的不同关系,这对到达邻居有影响:

Each topology in Figure 26-1 shows a different relationship between L3 and L2 addresses, which has implications for reaching neighbors:

图26-1(a)
Figure 26-1(a)

主机 A 和主机 B 属于同一个 10.0.1.0/24 IP 子网,因此可以直接通信,彼此仅相距一 L3 跳。他们是邻居。

Host A and Host B belong to the same 10.0.1.0/24 IP subnet and therefore can talk directly, being just one L3 hop away from each other. They are neighbors.

图26-1(b)
Figure 26-1(b)

这显示了一个稍微复杂的情况。主机 A 和主机 B 仍然属于同一子网,因此可以直接相互通信。另一方面,主机A和主机C属于两个不同的IP子网;因此,它们需要依赖路由器(假设它们已正确配置)来相互通信。在这种情况下,主机A和主机C可以被视为彼此相距两层L3跳。

This shows a slightly more complex case. Host A and Host B still belong to the same subnet and can therefore talk directly to each other. Host A and Host C, on the other hand, belong to two different IP subnets; because of this they need to rely on a router (assuming they have been configured properly) to talk to each other. In this case, Host A and Host C can be considered two L3 hops away from each other.

图26-1(c)
Figure 26-1(c)

这显示了连接到同一集线器的两台主机无法相互通信的情况。即使每台主机都可以接收其他主机传输的任何内容,它们也无法在 L3 层相互通信,因为它们配置了不同的 IP 子网。因此,主机 A 认为它只能到达子网 10.0.1.0/24 内的主机。如果目标地址位于该子网之外,它甚至不会尝试发送任何内容。这个问题可以通过多种方式轻松解决;我们将在下一节中看到其中一个,并在其他章节中看到更多内容。

This shows a case of two hosts, connected to the same hub, that cannot talk to each other. Even if each host can receive whatever the other host transmits, they cannot talk to each other at the L3 layer because they have been configured with different IP subnets. Thus, Host A thinks it can reach only hosts within the subnet 10.0.1.0/24. It will not even try to send anything if the destination address is outside that subnet. This problem can be solved easily in numerous ways; we will see one in the following section, and more in other chapters.

图26-1(d)
Figure 26-1(d)

这显示了子网 10.0.1.0/24 实际上由通过集线器或网桥合并为一个子网的两个 LAN 组成。我们在第 14 章中看到了它们的不同之处,但从本章的角度来看,它们可以被认为是等效的。请注意,用于合并两个 LAN 的两个接口没有 IP 地址:这是因为所有三种设备类型都在 IP 层下运行。

当两台主机彼此相距一L3跳时,它们通常也相距一L2跳,如图26-1(a)、(b)和(c)所示。但情况并不一定总是如此,如图 26-1(d)所示,其中主机 A 和路由器相距 1 个 L3 跳(因此是邻居),但相距 2 个 L2 跳。

This shows a case where the subnet 10.0.1.0/24 actually consists of two LANs merged into one subnet through a hub or a bridge. We saw how they differ in Chapter 14, but from this chapter's perspective they can be considered equivalent. Note that the two interfaces used to merge the two LANs do not have IP addresses: this is because all three device types operate below the IP layer.

When two hosts are one L3 hop away from each other, they are usually one L2 hop away as well, as in Figure 26-1(a), (b), and (c). But this is not necessarily always the case, as shown in Figure 26-1(d), where Host A and router are one L3 hop apart (and therefore are neighbors) but two L2 hops apart.

此外,物理子网(LAN)和逻辑子网(即IP子网)之间的关系并不总是一对一的,如图26-2(a)所示。您可以在一个 LAN 上拥有多个 IP 子网,也可以在一个 IP 子网上拥有多个 LAN。例如,图 26-1(c)显示了同一 LAN 上的两个 IP 子网,图 26-1(d)显示了同一 IP 子网上(左侧)通过集线器连接的两个 LAN。不常见,后者通常在配置代理 ARP 或桥接时使用。您可以在第 28 章的“最终公共处理”部分中看到代理 ARP 配置的示例,以及在第四部分中看到桥接的示例。

Furthermore, the relationship between physical subnets (LANs) and logical subnets (i.e., IP subnets) is not always one-to-one, as shown in Figure 26-2(a). You can have multiple IP subnets on one LAN, or multiple LANs on one IP subnet. For example, Figure 26-1(c) shows two IP subnets on the same LAN, and Figure 26-1(d) shows two LANs connected by a hub on the same IP subnet (on the left side).While the former is not common, the latter is commonly used when configuring Proxy ARP or bridging. You can see an example of Proxy ARP configuration in the section "Final Common Processing" in Chapter 28, and examples of bridging in Part IV.

图 26-2(b)显示了配置为位于不同 IP 子网的两组主机。即使两个组的主机共享相同的 LAN,因此能够直接相互通信,它们也必须通过路由器,路由器在两侧进行侦听。路由器可以有两个不同的网络接口卡或 NIC(如图 26-2(b)所示),或具有多个 IP 配置的单个 NIC。这种情况非常罕见:例如,它可以用来解决设备暂时短缺或故障的问题。例如,如果出现图 26-2(a)中的场景,并且 LAN1 发生故障,您可以将 LAN1 的主机移动到 LAN2(包括路由器的eth0接口)[ * ]),一切都会恢复正常,无需更改 LAN1 上主机的 IP 子网配置。LAN2 上的主机仍将通过路由器访问其他主机。即使这种情况不常见,内核也必须能够正确处理它。这种情况的影响,特别是当路由器使用单个接口访问两个子网时(即,删除eth0并将其地址添加到eth1 ),将在第 28 章的“可调 ARP 选项”部分中解决。

Figure 26-2(b) shows two groups of hosts configured to lie on different IP subnets. Even if the hosts of the two groups share the same LAN and are therefore able to talk to each other directly, they have to go through the router, which listens on both sides. The router could have two different Network Interface Cards, or NICs (as shown in Figure 26-2(b)), or a single NIC with multiple IP configurations. This scenario is pretty uncommon: it could be used, for example, to address a temporary shortage of equipment or a failure. For example, if you had the scenario in Figure 26-2(a) and LAN1 failed, you could move LAN1's hosts to LAN2 (including the Router's eth0 interface[*]), and everything would work again without any need to change the IP subnet configuration of the hosts that were on LAN1. The hosts that were already on LAN2 will still access the other hosts through the router. Even if this scenario is uncommon, the kernel must be able to handle it properly. The implications of this scenario, especially when the router uses a single interface to access both subnets (i.e., eth0 is removed and its address is added to eth1), will be addressed in the section "Tunable ARP Options" in Chapter 28.

(a) IP_子网-LAN 1:1; (b) IP_子网-LAN n:1

图 26-2。(a) IP_子网-LAN 1:1;(b) IP_子网-LAN n:1

Figure 26-2. (a) IP_subnet-LAN 1:1; (b) IP_subnet-LAN n:1

在本章的其余部分中,我们不会明确提及图 26-2(b)的情况,但您应该记住,像这样的设置是可能的,并且并不违法。

In the rest of this chapter we will not explicitly mention the case of Figure 26-2(b), but you should keep in mind that setups like that one are possible and are not illegal.

需要邻近协议的原因

Reasons That Neighboring Protocols Are Needed

在本节中,我们将了解基本原因邻近的子系统。它们源于网络的基本分层以及以太网等共享媒体的存在。

In this section, we'll look at the basic reasons for the neighboring subsystem. They stem from the fundamental division of networks into layers, and the existence of shared media such as Ethernet.

当 L3 地址需要转换为 L2 地址时

When L3 Addresses Need to Be Translated to L2 Addresses

网络第二层(以太网、802.11 无线、令牌环、点对点等)和第三层(IP 或专有)协议之间的区别的原因是存在许多不同的 L2 协议来在邻居之间获取数据,而路由 L3 层不必担心使用什么介质进行传输。较高层应该能够使用相同的软件在两个系统之间发送数据包,无论它们是在以太网还是点对点连接上。

The reason for the distinction between the network Layer two (Ethernet, 802.11 wireless, Token Ring, point-to-point, etc.) and Layer three (IP or proprietary) protocols is that many different L2 protocols exist to take data between neighbors, whereas the routing L3 layer should not have to worry what medium is being used for transmission. The higher layer should be able to employ the same software to send packets between two systems whether they're on an Ethernet or a point-to-point connection.

图 26-3显示了需要相邻子系统做出不同响应的不同情况。

Figure 26-3 shows the different situations that require different responses by the neighboring subsystem.

点对点连接与共享介质

图 26-3。点对点连接与共享介质

Figure 26-3. Point-to-point connection versus shared medium

图26-3(a)显示了点对点连接,例如拨号线路。L2 协议相当简单,可以处理错误检查等问题,并且如果在半双工介质上运行,则可以轮流处理。相邻协议是最小的,因为它只需要调用 L2 协议。无法选择将数据包发送到哪个邻居。

Figure 26-3(a) shows a point-to-point connection, such as a dial-up line. The L2 protocol is fairly simple, handling such issues as error checking and taking turns if it's running on a half-duplex medium. The neighboring protocol is minimal, because it simply has to invoke the L2 protocol. There is no choice of which neighbor to send a packet to.

图26-3(b)显示了更复杂的情况:以太网或其他共享介质上的主机通过广播进行操作。如果主机 A 拥有主机 B 的数据,它必须将数据传输到电缆(或无线情况下的无线电波)上,并让共享介质上的所有系统接收该数据。它必须指示 L2 地址,以便主机知道数据是给它的。其他主机检查地址并忽略数据。邻居协议选择与数据包中的L3地址相对应的L2地址。

Figure 26-3(b) shows a more complicated situation: a host on an Ethernet or other shared medium that operates through broadcasts. If Host A has data for Host B, it must just place the data on the cable (or the radio waves, in the case of wireless) and let all systems on the shared medium receive it. It must indicate an L2 address so that one host knows the data is meant for it. Other hosts check the address and ignore the data. The neighboring protocol chooses the L2 address corresponding to the L3 address in the packet.

如果主机 A 和主机 B 由网桥隔开,则后者接受 L2 地址并将其定向到正确的主机;[ * ]相邻子系统不必担心。事实上,桥对于邻近的子系统是不可见的。

If Host A and Host B are separated by a bridge, the latter accepts the L2 address and directs it to the right host;[*] the neighboring subsystem doesn't have to worry about it. In fact, the bridge is invisible to the neighboring subsystem.

L3 地址与其对应的 L2 帧之间通常存在一对一的关系。具有多个L3地址的系统(通常是路由器)提供多个接口,以便保留L3地址和L2地址之间的一对一关系。但正如后面“特殊情况”部分所解释的,L3 层的多个组播地址可以映射到同一个 L2 地址。一个接口也可以配置多个IP 地址。

There is usually a one-to-one relationship between an L3 address and its corresponding L2 frame. A system with multiple L3 addresses (usually a router) provides multiple interfaces so that the one-to-one relationship between L3 addresses and L2 addresses is preserved. But as the later section "Special Cases" explains, multiple multicast addresses at the L3 layer can map to the same L2 address. It is also possible for an interface to be configured with multiple IP addresses.

共享媒体

Shared Medium

在共享介质中,一台主机发送的任何帧都会被与其直接连接的所有主机接收。一个简单的例子是无线链路。另一个常见的例子是与以太网 10-base2 一起使用的共享同轴电缆。

In a shared medium, any frame transmitted by one host is received by all the hosts directly connected to it. A simple example is a wireless link. Another common example is the shared coaxial cable used with Ethernet 10-base2.

因此,共享媒体中使用的链路层协议需要定义寻址方案,以便发送者可以指定每个帧的接收者,并且接收者可以识别发送者。寻址方案通常还定义可用于将帧寻址到多个主机或所有主机的特殊地址:多播和广播地址。

For this reason, link layer protocols used in shared media need to define an addressing scheme so that a transmitter can specify the recipient of each frame, and the recipient can identify the sender. The addressing scheme usually also defines special addresses that can be used to address a frame to multiple hosts or to all of the hosts: the multicast and broadcast addresses.

由于多个主机可能需要同时传输并因此使用共享介质,因此链路层协议必须包含一种方法来确保连接到该介质的所有主机都检测到这种情况(称为冲突),因为结果是损坏 帧。以太网使用所谓的带有冲突检测协议的载波侦听多路访问 (CSMA/CD)。我们不会看碰撞是如何发生的之所以被处理,是因为这不是本章的主题。有关以太网相关所有内容的信息可以在《以太网:权威指南》 (O'Reilly)中找到。

Because multiple hosts may need to transmit and therefore use the shared medium at the same time, the link layer protocol must include a way to make sure all hosts connected to the medium detect this situation—called a collision— because the result is a corrupted frame. Ethernet uses the so-called Carrier Sense Multiple Access with Collision Detection protocol (CSMA/CD) . We won't look at how collisions are handled because that is off-topic for this chapter. Information on all things Ethernet-related can be found in Ethernet: The Definitive Guide (O'Reilly).

另一方面,点对点媒体(例如串行线路)仅设计用于两个端点之间的通信。在这种情况下,不需要使用链路层地址来识别源端点和目标端点。两个端点可以半双工或全双工方式进行通信,具体取决于它们是共享同一根电线还是各有一根电线。在任何一种情况下,都不需要冲突检测机制:两个端点要么被分配一条线路(全双工),要么具有一种机制,每个端点都可以使用该机制来获取共享线路的所有权。因此,当两台主机通过点对点介质连接时,不需要相邻协议。

On the other hand, point-to-point media, such as serial lines, are designed for communication between two endpoints only. In this case, there is no need to use a link layer address to identify the source and destination endpoints. The two endpoints can communicate in either half duplex or full duplex, depending on whether they share the same wire or have one each. In either case, there is no need for a collision detection mechanism: the two endpoints are either assigned one wire each (full duplex) or have a mechanism that each end can use to take ownership of the shared wire. As a consequence, there is no need for a neighboring protocol when two hosts are connected through a point-to-point medium.

以太网最初设计为与共享介质一起工作,允许主机共享相同的介质并依靠 CSMA/CD 来处理冲突。这是共享同轴电缆时代(即10Base-2)。然而,随着时间的推移,共享同轴电缆的使用已被非屏蔽双绞线 (UTP) 线所取代,或 RJ-45 线,由于各种原因。后者允许以太网接口配置为半双工和全双工模式,因为 UTP 电缆包含足够的电线以允许两端同时通话。全双工模式的以太网只能用于两个以太网接口之间的点对点连接。在这种情况下,连接的每一端都被分配一根线用于发送,一根线用于接收,因此不需要 CSMA/CD。

Ethernet was first designed to work with a shared medium, allowing hosts to share the same medium and rely on CSMA/CD to handle collisions. This was the shared coaxial cable era (i.e., 10Base-2). However, over time the use of shared coaxial cables has been replaced with the use of unshielded twisted pair (UTP) wire , or RJ-45 wire , for a variety of reasons. The latter allows Ethernet interfaces to be configured in both half-duplex and full-duplex mode, because the UTP cable includes enough wires to allow both ends to speak at the same time. Ethernet in full-duplex mode can be used only on point-to-point connections between two Ethernet interfaces. In such a case, each end of the connection is assigned one wire for transmission and one for reception, so there is no need for CSMA/CD.

如今,以太网 LAN 主要通过交换机实现:[ * ]您使用 UTP 电缆将每个主机连接到交换机。在这些情况下,您可以将接口配置为半双工模式,在这种情况下,CMSA/CD 用于处理交换机端口和主机的以太网适配器之间的冲突,或者您可以将两个接口配置为全双工模式,然后再将其配置为半双工模式。允许主机和交换机同时传输。两个端点必须使用相同的双工配置。在大多数情况下,不需要在连接的两端显式配置双工模式,因为双工检测机制会处理它。

Nowadays, Ethernet LANs are mainly implemented with switches:[*] you connect each host to a switch with a UTP cable. In these scenarios, you can either configure the interfaces in half-duplex mode, in which case CMSA/CD is used to handle collisions between the switch port and the host's Ethernet adapter, or you can configure the two interfaces in full-duplex mode and allow both the host and the switch to transmit simultaneously. Both endpoints must use the same duplex configurations. In most cases, there is no need to explicitly configure the duplex mode on the two ends of the connection, because a duplex detection mechanism takes care of it.

请注意,主机生成的帧永远不会发送至交换机(尽管此一般规则也有例外);主机使用交换机来到达连接到同一交换机的其他主机。因此,即使接口处于全双工模式时不需要 CSMA/CD,您仍然需要源地址和目标地址,因此还需要邻居协议。这也意味着由真正共享的介质(例如同轴电缆)提供的多播和广播功能现在由交换机通过其他方式提供:当交换机接收到寻址到多播或广播链路层地址的帧时,它会将其复制到除接收帧的端口之外的所有端口。我们看到在第四部分交换机实际上比这更聪明。

Note that the frames generated by the hosts are never addressed to the switch (although there are exceptions to this general rule); the switch is used by a host to reach the other hosts connected to the same switch. Therefore, even though you do not need CSMA/CD when the interfaces are in full-duplex mode, you still need the source and destination addresses, and therefore a neighboring protocol. This also means that the multicast and broadcast capabilities that were provided by a really shared medium, such as the coaxial cable, are now provided by the switch by other means: when the switch receives a frame addressed to a multicast or broadcast link layer address, it copies it to all ports except for the one from which the frame is received. We saw in Part IV that switches are actually smarter than this.

鉴于现代 LAN 主要通过以太网交换机实现,并且主机通过点对点链路 (UTP) 连接到交换机,因此 CSMA/CD 的使用在较新的以太网标准的设计中已变得次要。也出于这个原因(以及其他原因),专为更高速度而设计的新以太网标准使得 CSMA/CD 的使用成为可选或完全删除。

Given that modern LANs are mainly implemented with Ethernet switches, and hosts are connected to switches with point-to-point links (UTP), the use of CSMA/CD has become of secondary importance in the design of newer Ethernet standards. Also for this reason (among others), newer Ethernet standards designed for higher speeds made the use of CSMA/CD optional or removed it altogether.

表 26-1指出了哪些类型的以太网支持 CSMA/CD。请注意,千兆位以太网仍然支持 CSMA/CD(共享),尽管它主要用于全双工点对点连接。10 Gigabit 以太网主要用于 WAN(而不是 LAN)标准化,根本不支持 CSMA/CD,只能用于光纤介质上的点对点链路。表 26-1的每个元素 实际上都有很多变体,但我没有包括它们,因为我们的讨论不需要它们。

Table 26-1 indicates which flavors of Ethernet support CSMA/CD. Note that Gigabit Ethernet still supports CSMA/CD (shared), even though it is mainly used for full-duplex point-to-point connections. 10 Gigabit Ethernet, standardized mainly for use with WANs (as opposed to LANs), does not support CSMA/CD at all, and can be used for point-to-point links over fiber-optic media only. For each element of Table 26-1 there are actually many variants, but I did not include them because they are not needed for our discussion.

表 26-1。以太网风格和点对点/共享介质功能

Table 26-1. Ethernet flavors and point-to-point/shared medium capabilities

以太网风味

Ethernet flavor

仅限点对点

Point-to-point only

共享(即支持CSMA/CD)

Shared (i.e., supports CSMA/CD)

以太网(10 Mbit/s)

Ethernet (10 Mbit/s)

 

X

X

快速以太网(100 Mbits/s)

Fast Ethernet (100 Mbits/s)

 

X

X

千兆位以太网

Gigabit Ethernet

 

X

X

10 Gb 以太网

10 Gigabit Ethernet

X

X

 

为什么静态分配地址是不够的

Why Static Assignment of Addresses Is Not Sufficient

我们已经在第 13 章中了解了L2 和 L3 地址和协议的作用。L3地址,例如IP地址,是逻辑地址;这意味着任何有效地址都可以分配给任何接口。另一方面,L2 地址绑定到 NIC,并且不应该是可配置的:它们由供应商分配给接口,并且在全球范围内是唯一的。但是,大多数 NIC 可以通过ifconfig等常用工具配置为使用任意 L2 地址。这在处理本地 IEEE 地址时可能很有用,如第 13 章所述。但是,当您将 NIC 的 L2 地址更改为您不拥有的值时,您需要自行承担风险:您不再确信该地址是唯一的,因此无法在标识 NIC 的共享介质上正确运行通过他们的 L2 地址。通常,这是由受过高等教育的管理员在特殊配置中完成的,例如虚拟服务器或高可用性设置。

We already saw in Chapter 13 the roles of L2 and L3 addresses and protocols. L3 addresses, such as IP addresses, are logical; this means that any valid address can be assigned to any interface. L2 addresses, on the other hand, are bound to NICs and are not supposed to be configurable: they are assigned to the interfaces by the vendors and are unique worldwide. However, most NICs can be configured to use arbitrary L2 addresses via common tools like ifconfig. This may be useful when dealing with local IEEE addresses, as described in Chapter 13. But when you change the L2 address of an NIC to a value that you do not own, you do it at your own risk: you are not assured anymore that the address is unique and can therefore operate correctly on a shared medium where NICs are identified by their L2 addresses. Normally this is done in special configurations by highly educated administrators, such as virtual servers or high-availability setups.

由于 L3 地址是逻辑地址,因此它们可能会因多种原因而发生变化。以下是 L3 地址可能发生更改的一些常见情况。这些还需要更改 L3 地址和关联的 L2 地址之间的映射。

Because L3 addresses are logical, they can change for many reasons. Here are some common cases where an L3 address can change. These require the mapping between the L3 address and the associated L2 address to change as well.

动态配置
Dynamic configuration

在 IP 网络中,可以通过 DHCP 等协议为主机分配动态 IP 地址。同一主机每次请求 IP 地址时都可以指定不同的 IP 地址,但硬件地址被硬编码到以太网或无线卡中,因此必须相应地更新 L3 到 L2 映射。

In IP networks, a host can be assigned a dynamic IP address by means of a protocol such as DHCP. The same host can be given a different IP address every time it asks for one, but the hardware address is hardcoded into the Ethernet or wireless card, so the L3-to-L2 mapping must be updated accordingly.

更换有故障的接口
Replacement of a faulty interface

一旦更换 NIC,L2 地址就会发生变化,但管理员可能更愿意在网络上保留相同的逻辑配置,因此保留相同的 L3 地址。

The L2 address changes once the NIC is replaced, but the administrator would probably prefer to keep the same logical configuration on the network, and therefore the same L3 address.

移动 L3 地址
Moving an L3 address

服务器可能会宕机并需要由不同的服务器处理相同的流量;这意味着旧的 L3 地址应与新服务器和新接口关联。如果管理员将 L3 地址保留在同一主机上但使用不同的接口,则也需要进行更改。

A server may go down and require the same traffic to be handled by a different server; this means the old L3 address should be associated with a new server and a new interface. The change is required also if an administrator keeps the L3 address on the same host but uses a different interface.

为了使所有这些更改与 L2 和 L3 层隔离(因为它们有大量工作要做,而无需处理所有可能发生的情况和涉及的缓存),需要一个协议来管理 L3 到 L2 地址的关联。这就是本书这一部分讨论的邻居协议。

To keep all of these changes isolated from both the L2 and L3 layers—because they have plenty of work to do without handling all the eventualities and caching involved—a protocol is needed to manage the association of L3 to L2 addresses. That is the neighboring protocol discussed in this part of the book.

特别案例

Special Cases

有时不需要任何协议将 L3 地址解析为 L2 地址。这些案例包括以下内容:

Sometimes there is no need for any protocol to resolve the L3 address to an L2 address. These cases include the following:

  • 点对点介质上只能将数据发送到一台主机,例如拨号连接或将系统临时连接到管理员想要监控的主机的电缆。这里,L2 级别根本不存在寻址方案。(但是,即使点对点媒体在某些情况下也使用 L2 地址。)

  • There is only one host that data can be sent to on a point-to-point medium, such as a dial-up connection or a cable connecting a system temporarily to one that an administrator wants to monitor. Here, there is no addressing scheme at all at the L2 level. (However, even point-to-point media use L2 addresses in some contexts.)

  • 可能存在特殊的L3地址,其关联的L2地址可以通过简单的公式得到;因为没有歧义,也没有动态分配,所以不需要协议。

  • There may be special L3 addresses whose associated L2 addresses can be obtained with a simple formula; because there is no ambiguity and no dynamic allocation, no protocol is needed.

  • 组播地址无需任何协议即可静态转换。在 IPv4/ARP 网络上,使用 函数 解析多播地址,当设备是以太网 NIC 时,arp_mc_map该函数又调用非常简单的函数。ip_eth_mc_map中的映射ip_eth_mc_map是通过公式完成的,无需任何协议,如下所述并如图 26-4所示:

    • 最高有效 24 位分配给 IANA 分配的静态值 01:00:5E。

    • 位 23(低 24 位的最高有效位)设置为 0。

    • 最低有效 23 位是从 IP 地址的最低有效 23 位复制而来的。

  • Multicast addresses can be statically translated without any protocol. On IPv4/ARP networks, multicast addresses are resolved using the function arp_mc_map, which in turn invokes the very simple function ip_eth_mc_map when the device is an Ethernet NIC. The mapping in ip_eth_mc_map is done by a formula, without any protocol, as explained here and illustrated in Figure 26-4:

    • The most-significant 24 bits are assigned the static value 01:00:5E allocated by IANA.

    • Bit 23 (the most-significant bit of the lower 24) is set to 0.

    • The least-significant 23 bits are copied from the least-significant 23 bits of the IP address.

请注意,同一个以太网多播地址可以分配给多个 IP 地址(因为不使用 IP 地址的最高有效 9 位)。

Note that the same Ethernet multicast address can be assigned to multiple IP addresses (because the most-significant 9 bits of the IP address are not used).

从 IPv4 多播地址生成以太网多播地址

图 26-4。从 IPv4 多播地址生成以太网多播地址

Figure 26-4. Generation of an Ethernet multicast address from an IPv4 multicast address

  • 广播地址(IP 子网广播)静态解析为链路层广播地址(对于以太网为 FF:FF:FF:FF:FF:FF)。如果需要,还可以显式配置每个设备的 L2 广播地址。

  • Broadcast addresses (IP subnet broadcasts) are statically resolved to the link layer broadcast address (FF:FF:FF:FF:FF:FF for Ethernet). The L2 broadcast address of each device can also be explicitly configured, if needed.

征集请求和答复

Solicitation Requests and Replies

当 L3 到 L2 映射无法通过上一节中描述的静态转换来解析时,需要相邻协议来执行映射。不同的协议可能使用不同的机制。但对于所有这些协议,熟悉以下术语很有用,我们将在本书的这一部分中广泛使用这些术语:

When an L3-to-L2 mapping cannot be resolved through a static translation as described in the previous section, a neighboring protocol is needed to do the mapping. Different protocols may use different mechanisms. But for all of these protocols, it's useful to be familiar with the following terminology, which we'll use extensively in this part of the book:

征集请求(也称为邻居征集
Solicitation request (also called a neighbor solicitation)

这是指在网络上传输数据包,询问所有主机是否知道与给定 L3 地址关联的 L2 地址。该请求可以作为单播、多播或广播发送,具体取决于协议和上下文。

This refers to the transmission of a packet on the network to ask all of the hosts whether any knows the L2 address associated with a given L3 address. This request can be sent as unicast, multicast, or broadcast, depending on both the protocol and the context.

恳求回复(也称为邻居广告
Solicitation reply (also called a neighbor advertisement)

这是通常为回复请求请求而发送的数据包。但它也可以独立生成(参见第 28 章“免费 ARP ”部分的示例)。在正常情况下,与目标 L3 地址关联的主机生成回复,但也可以有另一个主机回复代替(请参阅“代理邻居协议”部分)。它通常作为单播发送,但在特定条件下也可以进行广播。

This is the packet that is normally sent in reply to a solicitation request. But it could also be generated independently (see the section "Gratuitous ARP" in Chapter 28 for an example). Under normal conditions, the host associated with the target L3 address generates the reply, but it is possible to have another host reply in its place (see the section "Proxying the Neighboring Protocol"). It is normally sent as unicast, but under specific conditions broadcasts are possible, too.

Linux 实现

Linux Implementation

早期的Linux内核有直接由邻近协议提供的L3协议调用功能。因此,IPv4 子系统直接与 ARP 代码交互。在最新版本的内核中,开发人员已经确定了不同协议的共同要求,并将它们抽象到称为相邻基础设施的新层中。

Early Linux kernels had L3 protocols call functions provided by neighboring protocols directly. The IPv4 subsystem, therefore, interacted directly with the ARP code. In recent versions of the kernel, developers have identified common requirements for different protocols and have abstracted them into a new layer called the neighboring infrastructure.

由于内核仍然包含尚未更新到新的协议无关层的旧代码,因此您仍然可以找到对ARP 代码中的一些已弃用的函数(例如,但它们是例外。第 27 章中的“ L3 协议和相邻协议之间的公共接口arp_find)部分详细讨论了与相邻基础设施的接口。

Because the kernel still includes old pieces of code that have not been updated to the new, protocol-independent layer, you can still find direct calls to a few deprecated functions of the ARP code (e.g., arp_find), but they are exceptions. The section "Common Interface Between L3 Protocols and Neighboring Protocols" in Chapter 27 discusses in detail the interface to the neighboring infrastructure.

图 26-5显示了 Linux 相邻子系统的关键部分以及与它们交互的内核的其他部分。L3 协议通过公共接口与相邻层交互,该接口根据请求服务的 L3 协议使用正确的相邻协议(ARP、ND 等)。[ * ]

Figure 26-5 shows the key parts of Linux's neighboring subsystems and the other parts of the kernel with which they interact. The L3 protocols interact with the neighboring layer via a common interface, which uses the right neighboring protocol (ARP, ND, etc.) depending on the L3 protocol that is asking for the service.[*]

传输数据包时,会发生以下步骤:

When transmitting a packet, the following steps take place:

  1. 本地主机的路由子系统选择L3目标地址(下一跳)。

    大局观

    图 26-5。大局观

  2. The routing subsystem of the local host selects the L3 destination address (the next hop).

    Figure 26-5. The big picture

  3. 如果根据路由表,该跳在同一网络上(如果下一跳是邻居),则相邻层将目的地址的三层地址解析为二层地址。该关联被缓存以供将来使用。因此,如果一个应用程序在短时间内向另一个应用程序发送多个数据包,则相邻协议仅使用一次,以发送第一个数据包。

  4. If according to the routing table, this hop is on the same network (if, that is, the next hop is a neighbor), the neighboring layer resolves the destination's L3 address to its L2 address. This association is cached for future use. Thus, if one application sends several packets of data in a short amount of time to another application, the neighboring protocol is used only once, to send the first packet.

  5. dev_queue_xmit 最终,诸如(第 11 章中描述的)之类的功能负责传输,将数据包交给流量控制或服务质量 (QoS) 层。虽然图 26-5 仅显示了dev_queue_xmit,但相邻层实际上也可以调用其他函数(主要是 周围的包装器dev_queue_xmit),正如我们将在本章后面看到的那样。

  6. Eventually, a function such as dev_queue_xmit (described in Chapter 11) takes care of the transmission, handing the packet to the Traffic Control or Quality of Service (QoS) layer. Although Figure 26-5 shows only dev_queue_xmit, the neighboring layer can actually invoke other functions as well (mostly wrappers around dev_queue_xmit), as we will see later in this chapter.

请注意,dev_queue_xmit当要传输的数据包准备好发送时调用,因此如果需要 L2 标头,相邻层必须在调用该函数之前添加它。某些类型的传输(点对点连接、广播和多播)不需要任何 L2 层标头,因此不需要 L3 到 L2 映射;这些传输包含在“特殊情况”部分中。其他传输使用共享介质,因此需要来自相邻子系统的缓存或通过相邻子系统向网络发出的请求的 L2 标头。

Note that dev_queue_xmit is called when the packet to transmit is ready to go, so if an L2 header is needed, the neighboring layer must add it before calling the function. Certain types of transmissions—point-to-point connections, broadcasts, and multicasts—do not require any L2 layer header and therefore do not need an L3-to-L2 mapping; these transmissions are covered in the section "Special Cases." Other transmissions use a shared medium and therefore need an L2 header, either from the neighboring subsystem's cache or through a request issued by the neighboring subsystem to the network.

邻近协议

Neighboring Protocols

如今,IP 网络中使用两种协议。绝大多数系统使用 IPv4 的 ARP。为 IPv6 开发了一种更通用的协议,称为邻居发现 (ND)。其他邻近协议也在 Linux 内核中实现,用于专有网络,例如 DECnet 使用的协议,但由于它们的用途有限,我们不会在本书中介绍它们。

Two protocols are in use in IP networks today. The vast majority of systems use ARP with IPv4. A more general-purpose protocol called Neighbor Discovery (ND) was developed for IPv6. Other neighboring protocols are also implemented in the Linux kernel for use with proprietary networks, such as the one used by DECnet, but we will not cover them in this book due to their limited use.

尽管 ARP 被认为是 L3 协议,但该任务已被 IPv6 的设计者转移到 L4 中。如图26-6所示,ND协议被认为是互联网控制消息协议(ICMP)的IPv6实现的一部分。这一选择基于多年的 IPv4 经验。它为 ND 提供了多项优势,其中包括利用 IPsec 加密等 L3 功能的机会。第 28 章中的“ ND (IPv6) 相对于 ARP (IPv4) 的改进”一节概述了 ND 和 ARP 之间的主要区别。

Although ARP is considered an L3 protocol, the task has been moved into L4 by the designers of IPv6. As shown in Figure 26-6, the ND protocol is considered a part of the IPv6 implementation of the Internet Control Message Protocol (ICMP). This choice was based on years of experience with IPv4. It provides ND with several advantages, among them the opportunity to take advantage of L3 features such as IPsec encryption. The section "Improvements in ND (IPv6) over ARP (IPv4)" in Chapter 28 gives an overview of the key differences between ND and ARP.

ARP/ND 协议在网络堆栈中的位置

图 26-6。ARP/ND 协议在网络堆栈中的位置

Figure 26-6. Positions of the ARP/ND protocols in the network stack

如前所述,Linux 还提供了一个通用基础设施,以减少所有相邻协议中非常相似的服务的开销和代码复制。通用邻近基础设施提供可以通过不同协议定制的服务以满足其需求。以下是基础设施向协议提供的一些服务:

As mentioned, Linux also provides a common infrastructure to reduce overhead and code replication for services that are very similar across all neighboring protocols. The generic neighboring infrastructure provides services that can be tailored by different protocols to suit their needs. Here are some of the services provided by the infrastructure to the protocols:

  • 用于存储 L3 到 L2 转换结果的每协议缓存。

  • A per-protocol cache to store the results of L3-to-L2 translations.

  • 添加、删除、更改和查找缓存中特定翻译条目的功能。因为查找功能对系统性能影响最大,所以它必须要快。

  • Functions to add, remove, change, and look up a specific translation entry in the cache. Because the lookup function influences the performance of the system most of all, it must be fast.

  • 每个协议缓存中条目的老化机制。

  • An aging mechanism for the entries in the per-protocol cache.

  • 当请求在缓存中创建新的翻译条目并且缓存已满时要遵循的策略选择。

  • A choice of policies to follow when there is a request for a new translation entry to be created in the cache, and the cache is full.

  • 每个邻居的请求队列。当数据包准备好发送并且 L2 地址尚未在缓存中时,必须缓冲该数据包,直到发送请求请求并收到答复为止。请参阅第 27 章中的“排队”部分。

  • A per-neighbor request queue. When a packet is ready to be sent and the L2 address is not already in the cache, the packet must be buffered until a solicitation request is sent and the reply is received. See the section "Queuing" in Chapter 27.

为了让每个协议定制相邻子系统的行为,它定义了一组占位符或虚拟函数,每个协议都为其插入它想要使用的函数。这与 Linux 内核的大部分允许定制的方式类似。相邻层还提供了一系列调整参数,可以通过用户空间命令/proc或协议本身进行配置。最后,访问缓存的函数对于所有协议都是通用的,但是不同的协议可能使用不同大小的密钥(地址)。因此,基础设施提供了一种通用方法来定义要使用哪种类型的密钥。后面的章节将详细介绍所有这些要点。

To let each protocol tailor the behavior of the neighboring subsystem, it defines a set of placeholder or virtual functions for which each protocol plugs in the functions it wants to use. This is similar to the way much of the Linux kernel allows customization. The neighboring layer also provides a bunch of tuning parameters that can be configured via user-space commands, /proc, or the protocol itself. Finally, the functions to access the cache are common to all of the protocols, but different protocols may use keys (addresses) of different sizes. Therefore, the infrastructure provides a generic way to define which type of key to use. Later chapters will cover all of these points in detail.

每个协议都可以独立于其他协议运行和配置。第 27 章中的“协议初始化和清理”部分展示了相邻协议如何向内核注册和取消注册。

Each protocol can run and be configured independently from the others. The section "Protocol Initialization and Cleanup" in Chapter 27 shows how a neighboring protocol registers and unregisters itself with the kernel.

代理邻近协议

Proxying the Neighboring Protocol

当一台主机拦截发往另一台主机的流量并代替后者进行处理时,我们称其充当代理。当然,该术语并不涵盖发起中间人攻击的恶意主机。相反,代理的一个常见示例是缓存 HTTP 服务器,它通过拦截定向到流行 Web 服务器的请求并提供来自存储在其自己的缓存中的 Web 服务器的页面来减少网络流量。

When a host intercepts traffic addressed to another host and processes it in place of the latter, it is said to act as a proxy. The term does not, of course, cover a malicious host that launches a man-in-the-middle attack. Rather, a common example of a proxy is a caching HTTP server that cuts down on network traffic by intercepting requests directed to popular web servers and serving up pages from those web servers that are stored in its own cache.

如果主机和应用程序不需要显式配置即可从代理提供的服务中受益,则该代理被称为透明的。刚才提到的缓存HTTP服务器就是透明代理的一个例子。但如图 26-7所示,服务可以由透明代理或非透明代理提供。该图显示了正在使用的 HTTP 代理的两个示例:

If hosts and applications do not need to be explicitly configured to benefit from the services provided by a proxy, this proxy is said to be transparent. The caching HTTP server just mentioned is an example of a transparent proxy . But as Figure 26-7 shows, a service could be provided by either a transparent proxy or a nontransparent proxy. The figure shows two examples of an HTTP proxy in use:

  • (a) 代理安装在本地网络用来访问互联网的路由器上。所有来自网络上主机的浏览器请求都会经过路由器,因此管理员可以配置路由器来拦截和代理所有 HTTP 请求。这被认为是透明代理因为主机 B 上不需要配置或专门编程的浏览器。

  • (a) The proxy is installed on the router used by a local network to access the Internet. All browser requests from hosts on the network go through the router, so the administrator can configure the router to intercept and proxy all HTTP requests. This is considered transparent proxying because no configuration or specially programmed browser is needed on Host B.

  • (b) 主机B的浏览器配置为使用名为Proxy的主机上的代理浏览Internet。主机代理在需要时(即存在缓存未命中时)使用路由器。

  • (b) The browser of Host B is configured to use the proxy on the host named Proxy to browse the Internet. The host Proxy uses the router when it is needed (that is, when there is a cache miss).

当然,其他几种选择也是可能的。例如,代理可以是单独的机器,而路由器被配置为将HTTP请求中继到代理。我不会详细讨论这个主题,因为这是一个超出本书范围的大主题。相邻协议的代理通常是透明的。

Of course, several other options are possible. For instance, the proxy may be a separate machine, while the router is configured to relay HTTP requests to the proxy. I will not go into detail on this topic, because it is a large topic outside the context of this book. Proxies for neighboring protocols are normally transparent.

(a) 透明代理; (b) 不透明代理

图 26-7。(a) 透明代理;(b) 不透明代理

Figure 26-7. (a) Transparent proxy; (b) nontransparent proxy

前面的示例展示了一种流行的代理类型:HTTP 或 Web 代理。现在让我们考虑一下与本书这一部分相关的代理。相邻协议的代理服务器是被配置为应答对其不拥有的地址的请求请求的主机,代替实际具有这些地址的其他主机。借助代理,位于不同 LAN 上的主机可以相互通信,就好像它们位于同一 LAN 上一样。

The previous example showed one popular type of proxying: HTTP or web proxying. Now let's consider proxying in relation to this part of the book. A proxying server for a neighboring protocol is a host that is configured to reply to solicitation requests for addresses it does not own, in place of other hosts that actually have those addresses. Thanks to the proxy, hosts located on different LANs can talk to each other as if they were on the same LAN.

例如,代理 ARP 通常用于 IPv4 网络,以帮助从扁平网络向子网网络的过渡。主机不需要特殊的协议或配置,因为代理对它们来说是透明的。但是,如果代理服务器出现故障,与被代理的主机的连接也会丢失。这可以通过提供多个代理服务器来缓解。在这种情况下,主机可能会收到对其(广播)请求的多个恳求回复。通过选择第一个,主机可能会获得最快或负载最少的代理服务器。

For instance, proxy ARP is commonly used in IPv4 networks to help in transitions from flat to subnetted networks. The hosts do not need special protocols or configuration because the proxy is transparent to them. But if the proxy server goes down, the connectivity to the hosts being proxied is lost, too. This can be mitigated by providing multiple proxy servers. In that case, a host may receive multiple solicitation replies to its (broadcast) requests. By selecting the first one, a host probably gets the fastest or least-loaded proxy server.

使用代理还可以简化代理所负责的主机的配置;第 28 章的“代理服务器与路由器”一节提供了一个示例。

The use of proxies can also simplify the configuration of hosts taken care of by a proxy; one example is provided in the section "Proxy VS Router" in Chapter 28.

在Linux内核实现的邻居协议中,只有IPv4和IPv6可以使用代理功能。两个协议共享公共基础设施,每个协议都根据其需求定制代理行为。这些差异在第 28 章的“ ND (IPv6) 相对于 ARP (IPv4) 的改进”一节中进行了解释。

Among the neighboring protocols implemented in the Linux kernel, only IPv4 and IPv6 can use the proxy feature. The common infrastructure is shared by both protocols, each of which tailors proxying behavior to its needs. The differences are explained in the section "Improvements in ND (IPv6) over ARP (IPv4)" in Chapter 28.

在第 27 章的“充当代理”部分中,我们将详细看到该功能的协议无关组件的实现(计时器、队列等)。在第28章的“代理ARP ”部分中,我们将详细了解IPv4和ARP的具体情况。

In the section "Acting As a Proxy" in Chapter 27, we will see the implementation of the protocol-independent component of this feature in detail (timers, queues, etc.). In the section "Proxy ARP" in Chapter 28, we will see details on the specific case of IPv4 and ARP.

代理人所需的条件

Conditions Required by the Proxy

并非代理收到的所有招标请求都会得到处理。如果满足以下所有条件,代理服务器将回复地址请求请求:

Not all of the solicitation requests received by a proxy are processed. A proxy server replies to a solicitation request for an address if all of the following conditions are met:

  • 该地址与代理接收请求的接口上配置的子网不属于同一子网。由于代理服务器代替其他主机回复请求请求,因此这些主机不得与请求请求的发送者位于同一子网中。否则,目标主机和代理都会做出响应,并且不清楚发送者会选择哪一个。

  • The address does not belong to the same subnet as the one configured on the interface where the proxy received the request. Because a proxy server replies to solicitation requests in place of other hosts, these hosts must not reside on the same subnet as the sender of the solicitation request. Otherwise, the target host would respond as well as the proxy and it would not be clear which one the sender would choose.

  • 代理功能已启用。这个规定听起来似乎很明显,但事实并非如此。有几个标准可以确定代理是否适用于给定的请求,并且这些标准在不同的相邻协议中有所不同。此外,Linux 内核提供了通用和更具体的代理形式:

    基于设备

    处理设备上收到的所有有效请求。这是 IPv4 网络中最常见的情况。IPv6 不使用它。

    基于目的地

    在决定是否代理时会考虑目标地址和设备。这意味着代理可以回复对选定 IP 地址的请求。基于目标的代理是 IPv6 网络中的标准,但也适用于 IPv4。

  • The proxy feature is enabled. This stipulation may sound obvious, but it is not. Several criteria can determine whether proxying applies to a given request, and these differ across different neighboring protocols. Furthermore, the Linux kernel provides both a general and a more-specific form of proxying:

    Device based

    All valid requests received on the device are processed. This is the most common case in IPv4 networks. IPv6 does not use it.

    Destination based

    Both the destination address and the device are taken into account during the decision whether to proxy. This means that a proxy can reply to requests for selected IP addresses. Destination-based proxying is standard in IPv6 networks, but is available for IPv4, too.

图 26-8显示了两种代理之间的优先级。当主机收到对本地子网之外的地址的请求请求时,如果启用了代理,主机可以处理该请求。首先,子系统检查设备上是否全局启用了代理,如果没有,则检查设备是否配置为代理该特定地址。

Figure 26-8 shows the precedence between the two kinds of proxying. When a host receives a solicitation request for an address outside the local subnet, the host may process it if proxying is enabled. First the subsystem checks whether proxying is enabled globally on the device, and if not, whether the device is configured to proxy that particular address.

设备和地址代理之间的优先级

图 26-8。设备和地址代理之间的优先级

Figure 26-8. Priority between device and address proxying

  • 在接收请求的代理服务器上启用转发。

    由于代理服务器在主机之间插入自身,因此它必须接受两个端点之间转发的流量。[ * ]

  • Forwarding is enabled on the proxy server on which the request was received.

    Because the proxy server interpolates itself between hosts, it has to accept forwarded traffic between the two endpoints.[*]

ARP 请求请求始终发送到 L2 广播地址。这可确保共享同一介质的所有主机都能收到它。因此,代理可以拦截发送到它所代理的主机的请求,而无需将其任何接口置于混杂模式。在进行可达性确认时(请参阅“可达性确认”部分),ARP 使用单播而不是广播。

ARP solicitation requests are always sent to the L2 broadcast address. This ensures that all of the hosts sharing the same medium receive it. Thus, a proxy can intercept requests addressed to those hosts it proxies for without having to put any of its interfaces into promiscuous mode. When doing reachability confirmation (see the section "Reachability Confirmation"), ARP uses unicasts rather than broadcasts.

ND 使用 L3 多播地址来处理请求请求和回复。当路由器想要代理给定的 IP 地址时,它需要订阅关联的 L3 多播地址。

ND uses L3 multicast addresses to handle solicitation requests and replies. When a router wants to proxy a given IP address, it needs to subscribe to the associated L3 multicast address.

何时传输和处理征集请求

When Solicitation Requests Are Transmitted and Processed

在本节中,我们将了解何时根据接收主机的配置和网络的物理拓扑处理请求请求。图 26-9 介绍了导致主机发出请求请求的因素,图 26-10显示了决定接收请求的 Linux 主机是否处理请求的最常见因素。为了显示接收者决策的潜在复杂性,图 26-10假设接收者同时实现代理和桥接[ * ];删除这些功能中的任何一个都会简化流程图。图26-10还假设基于设备的代理;基于目标的代理类似,但省略了一个步骤。请注意,图26-10同时显示了代理服务器的情况和未实现任何代理的普通主机的情况:“代理启用”表示代理服务器,“代理禁用”表示普通主机。

In this section, we will see when a solicitation request is processed, based on the configuration of the receiving host and the physical topology of the network. Figure 26-9 covers the factors that lead a host to send out a solicitation request, and Figure 26-10 shows the most-common factors that determine whether a request is processed by the Linux host that receives it. To show the potential complexity of the recipient's decision, Figure 26-10 assumes that the recipient implements both proxying and bridging[*]; removing either of these features would simplify the flowchart. Figure 26-10 also assumes device-based proxying; destination-based proxying is similar, but leaves out a step. Note that Figure 26-10 shows both the case of a proxy server and the case of a common host that does not implement any proxying: "proxy enabled" denotes a proxy server, and "proxy disabled" denotes a common host.

发送征集请求

图 26-9。发送征集请求

Figure 26-9. Transmitting solicitation requests

这是一个独立于实验方案的分析;有关 ARP 的详细信息请参见第 28 章

This is a protocol-independent analysis; particulars about ARP are shown in Chapter 28.

处理入口请求

图 26-10。处理入口请求

Figure 26-10. Processing ingress solicitation requests

启用桥接后,接收主机不会处理请求请求,而是根据桥接配置将请求转发(桥接)到正确的接口。桥接发生在相邻协议有机会查看入口数据包之前。换句话说,如图所示,在处理请求请求的 Linux 实现中,桥接是在代理之前处理的。详细信息请参见第四部分。

When bridging is enabled, solicitation requests are not processed by the receiving host, but are instead forwarded (bridged) to the right interfaces according to the bridging configuration. Bridging takes place before the neighboring protocol has a chance to look at the ingress packets. In other words, as the figure shows, bridging is handled before proxying in the Linux implementation of handling solicitation requests. See Part IV for details.

假设桥接被禁用。请记住,位于共享介质上的主机可以接收对属于其他主机的地址的请求请求,以下是可以影响 Linux 主机是否回复入口请求请求的变量:

Let's suppose bridging is disabled. Keeping in mind that a host that sits on a shared medium can receive solicitation requests for addresses that belong to other hosts, here are the variables that can influence whether a Linux host replies to an ingress solicitation request:

逻辑子网(例如IP子网)
Logical subnet (e.g., IP subnet)

当请求地址与接收请求请求的网卡上配置的L3地址属于同一逻辑子网(根据接收主机的配置)时,图26-10中的“同一逻辑子网”为。如果我们以 IPv4 为例,10.0.0.1(作为请求地址)和 10.0.0.2/24(作为接收网卡上配置的地址)将属于同一个 10.0.0.0/24 IP 子网。

当两台主机属于同一逻辑子网时,它们可以直接通信。否则,他们需要路由器的帮助。

请注意,一个接口可以在同一逻辑子网上配置多个地址(一个为主,另一个为辅助),也可以在不同逻辑子网上配置多个地址,或者两者的组合。如果接收 NIC 配置了不同子网中的多个地址,则请求的地址必须属于这些子网之一。

"Same logical subnet" in Figure 26-10 is true when the solicited address and the L3 address configured on the NIC that receives the solicitation request belong to the same logical subnet (according to the configuration of the receiving host). If we take IPv4 as an example, 10.0.0.1 (as the solicited address) and 10.0.0.2/24 (as the address configured on the receiving NIC) would belong to the same 10.0.0.0/24 IP subnet.

When two hosts belong to the same logical subnet, they can talk directly. Otherwise, they need the help of a router.

Note that an interface may be configured with multiple addresses on the same logical subnet (one will be primary and the others secondary), with multiple addresses on different logical subnets, or a combination of these two. If the receiving NIC was configured with multiple addresses on different subnets, the solicited address must belong to one of those subnets.

物理子网 (LAN)
Physical subnet (LAN)

当两台主机属于同一 LAN 时,理论上它们可以直接通信,但实际上是否这样做取决于逻辑(L3)配置。例如,在图26-1(c)中,主机位于同一LAN但位于不同的IP子网。

主机不会尝试解析属于不同逻辑子网的另一台主机的地址;相反,它解析路由器的地址,因为路由器是它需要与之通信以到达远程主机的主机。见图26-9

鉴于此,除非使用代理,否则主机永远不会(如果我们排除极端情况和错误)在 NIC 上收到对已知驻留在不同 NIC 上的 L3 地址的请求请求。因为图26-10显示的是接收者的视角,所以它不区分“不同逻辑子网”节点下的“相同物理子网”和“不同物理子网”,因为这不会产生任何区别:只有代理状态很重要。

When two hosts belong to the same LAN, they theoretically can talk directly, but whether they actually do so depends on the logical (L3) configuration. In Figure 26-1(c), for instance, hosts are on the same LAN but on different IP subnets.

A host does not try to resolve the address of another host that belongs to a different logical subnet; instead, it resolves the router's address because the router is the host it needs to talk to, to reach the remote host. See Figure 26-9.

Given this, a host will never (if we exclude corner cases and bugs) receive a solicitation request on an NIC for an L3 address known to reside on a different NIC, unless proxying is being used. Because Figure 26-10 shows the receiver's perspective, it does not distinguish between "Same physical subnet" and "Different physical subnet" under the "Different logical subnet" node because it would not make any difference: only the proxy status is important.

代理要求
Proxy requirement

并非代理收到的所有招标请求都会得到处理。有关详细信息,请参阅“代理所需的条件”部分。

Not all of the solicitation requests received by a proxy are processed. See the section "Conditions Required by the Proxy" for details.

第28章中的“处理入口ARP数据包”部分展示了ARP协议如何处理图26-10中的各种情况。

The section "Processing Ingress ARP Packets" in Chapter 28 shows how the various situations in Figure 26-10 are handled by the ARP protocol.

邻居状态和网络不可达检测 (NUD)

Neighbor States and Network Unreachability Detection (NUD)

图 26-11是内核在将数据包传输到给定 L3 地址时必须执行的步骤的简化摘要。

Figure 26-11 is a simplified summary of the steps the kernel has to go through when transmitting a packet to a given L3 address.

图26-12是一个简化模型,显示了邻居可以经历的状态。

Figure 26-12 is a simplified model that shows the states a neighbor can go through.

图 26-1126-12中的两个简单模型适用于大多数情况,但 Linux 内核使用更复杂的模型来处理所有可能的状态。下一节将扩展图26-12中的模型,后面的部分将重点讨论图26-11中的细节。

The two simple models in Figures 26-11 and 26-12 would work in most cases, but the Linux kernel uses a more sophisticated model to handle all possible states. The next section will expand the model in Figure 26-12, and later sections will focus on the details in Figure 26-11.

正如您所看到的,管理邻居的一个重要部分是了解他们是否 可达

As you can see, an important part of managing neighbors is to know whether they are reachable.

L3 到 L2 地址解析步骤

图 26-11。L3 到 L2 地址解析步骤

Figure 26-11. L3-to-L2 address resolution steps

L3 到 L2 映射的状态

图 26-12。L3 到 L2 映射的状态

Figure 26-12. States of an L3-to-L2 mapping

可达性

Reachability

从相邻子系统的角度来看,可达性可以通过现实生活中的类比来描述。假设你和其他人(包括我)一起在一个黑暗的房间里。如果你说“所有人都离开房间!” 每个人都会离开房间,因为他们都能听到你的声音。但如果你只想我一个人出去,你还需要一项信息:我的名字。

Reachability, from the neighboring subsystem's perspective, can be described through a real-life analogy. Suppose you are in a dark room with other people, including me. If you say "Everybody out of the room!" everybody will leave the room because they all can hear you. But if you want only me to go out, you will need one more piece of information: my name.

因此,发送到广播目标地址的请求回复所携带的信息量与使用单播目标地址的请求回复所携带的信息量不同:任何人都可以接收广播,但如果您想与给定的收件人通话,则需要确切的地址。

Thus, a solicitation reply sent to a broadcast destination address does not carry the same amount of information as one with a unicast destination address: anyone can receive a broadcast, but you need the exact address if you want to talk to a given recipient.

从邻居的角度来看,如果内核有证据表明接收者可以正确接收寻址到其单播地址的帧,则主机被认为是可达的,反之亦然。换句话说,内核需要双向可达性才能认为邻居可达。因此,在本章的其余部分,我们将使用术语“可达”来表示双向可达性。我们将在“可达性确认”一节中看到,有两种可能的方式来确认可达性:L4 确认和请求回复。

From the neighboring perspective, a host is considered reachable if the kernel has proof that the recipient can correctly receive frames addressed at its unicast address, and vice versa. In other words, you need bidirectional reachability for the kernel to consider a neighbor reachable. In the rest of this chapter, we will therefore use the term reachable to mean bidirectional reachability. We will see in the section "Reachability Confirmation" that there are two possible ways in which reachability can be confirmed: L4 confirmation and a solicitation reply.

NUD 状态之间的转换

Transitions Between NUD States

IPv6 定义了一种 NUD 机制,可以帮助快速确定邻居是否已断开连接或已关闭。Linux 内核对 IPv4 和 IPv6 使用相同的机制。我们不会在本书中介绍的其他协议也使用类似的模型,例如 DECnet。

IPv6 defines an NUD mechanism that can help determine quickly whether neighbors have disconnected or gone down. The Linux kernel uses the same mechanism for both IPv4 and IPv6. Similar models are used by the other protocols we will not cover in the book, such as DECnet.

图26-13总结了邻居可以采取的状态以及可以触发状态改变的条件。条目可以由多个事件创建,包括向邻居发送数据包的请求,或者从邻居接收请求请求。

Figure 26-13 summarizes the states a neighbor can assume and the conditions that can trigger a change of state. An entry can be created by several events, including the request to transmit a data packet to a neighbor, or the reception of a solicitation request from a neighbor.

一个条目的状态在其生命周期内可能会改变多次,并且一个条目可以多次进入相同的状态。不同的协议可以执行不同的转换,包括一些图中未示出的转换,以利用特殊条件。例如,将新创建的条目直接放入的链接 NUD_STALE由 IPv4 使用,但不由 IPv6 使用。

The state of an entry may change several times during its lifetime, and the same state can be entered multiple times by one entry. Different protocols may carry out different transitions, including some not shown in the figure, to take advantage of special conditions. For example, the link that puts a newly created entry directly into NUD_STALE is used by IPv4, but not by IPv6.

图 26-13中的状态描述如下。可能的值根据一些常见属性进行分组。该描述之后将讨论图中的转换,特别是 NUD 机制。

A description of the states in Figure 26-13 follows. The possible values are grouped based on some common properties. This description will be followed by a discussion of the transitions in the graph, and in particular the NUD mechanism.

基本状态

Basic states

图26-13中的状态 定义如下。我们从新创建条目的默认状态开始:

The states in Figure 26-13 are defined as follows. We start with the default state of a newly created entry:

NUD_NONE
NUD_NONE

邻居条目刚刚创建,还没有可用的状态。

NUD 状态之间的转变

图 26-13。NUD 状态之间的转变

The neighbor entry has just been created and no state is available yet.

Figure 26-13. Transitions among NUD states

下一组来自 IPv6 邻居定义,并已被最新的 Linux ARP/IPv4 实现所采用:

This next set comes from the IPv6 neighboring definition and has been adopted by the latest Linux ARP/IPv4 implementation as well:

NUD_INCOMPLETE
NUD_INCOMPLETE

询价已发出,但尚未收到回复。在这种状态下,没有可以使用的硬件地址(甚至没有旧的地址,就像 with 一样NUD_STALE)。

A solicitation has been sent, but no reply has been received yet. In this state, there is no hardware address to use (not even an old one, as there is with NUD_STALE).

NUD_REACHABLE
NUD_REACHABLE

邻居的地址被缓存并且已知后者是可达的(已经有可达性证明)。

The address of the neighbor is cached and the latter is known to be reachable (there has been a proof of reachability).

NUD_FAILED
NUD_FAILED

由于请求请求失败(创建条目时生成的请求或由状态触发的请求请求),将邻居标记为无法访问 NUD_PROBE

Marks a neighbor as unreachable because of a failed solicitation request, either the one generated when the entry was created or the one triggered by the NUD_PROBE state.

NUD_STALE
NUD_STALE

NUD_DELAY
NUD_DELAY

NUD_PROBE
NUD_PROBE

过渡国家;当本地主机确定邻居是否可达时,它们将得到解决。请参阅“可达性确认”部分。

Transitional states; they will be resolved when the local host determines whether the neighbor is reachable. See the section "Reachability Confirmation."

下一组值代表一组特殊状态,这些状态一旦分配通常就不会改变:

The next set of values represents a group of special states that usually never change once assigned:

NUD_NOARP
NUD_NOARP

此状态用于标记不需要任何协议来解析 L3 到 L2 映射的邻居(请参阅“特殊情况”部分)。第 28 章中的“ arp_constructor 函数的启动”部分 展示了在 IPv4/ARP 中如何以及为何设置此状态。但是,尽管该状态的名称表明它仅适用于 ARP,但它实际上可以被任何相邻协议使用。

This state is used to mark neighbors that do not need any protocol to resolve the L3-to-L2 mapping (see the section "Special Cases"). The section "Start of the arp_constructor Function" in Chapter 28 shows how and why this state is set in IPv4/ARP. But even though the name of this state suggests that it applies only to ARP, it can actually be used by any neighboring protocol.

NUD_PERMANENT
NUD_PERMANENT

邻居的 L2 地址已静态配置(即使用用户空间命令),因此无需使用任何邻居协议来处理它。请参阅第 29 章中的“邻居系统管理”部分。

The L2 address of the neighbor has been statically configured (i.e., with user-space commands) and therefore there is no need to use any neighboring protocol to take care of it. See the section "System Administration of Neighbors" in Chapter 29.

派生状态

Derived states

除了上一节列出的基本状态之外,还定义了以下派生值,只是为了在需要引用具有共同点的多个状态时使代码更清晰:

In addition to the basic states listed in the previous section, the following derived values are defined just to make the code clearer when there is a need to refer to multiple states with something in common:

NUD_VALID
NUD_VALID

NUD_VALID如果一个条目的状态是以下任一状态,则该条目被认为处于该状态,这代表被认为具有可用地址的邻居:

NUD_PERMANENT
NUD_NOARP
NUD_REACHABLE
NUD_PROBE
NUD_STALE
NUD_DELAY

An entry is considered to be in the NUD_VALID state if its state is any one of the following, which represent neighbors believed to have an available address:

NUD_PERMANENT
NUD_NOARP
NUD_REACHABLE
NUD_PROBE
NUD_STALE
NUD_DELAY
NUD_CONNECTED
NUD_CONNECTED

这用于NUD_VALID 没有等待确认过程的状态子集:

NUD_PERMANENT
NUD_NOARP
NUD_REACHABLE

This is used for the subset of NUD_VALID states that do not have a confirmation process pending:

NUD_PERMANENT
NUD_NOARP
NUD_REACHABLE
NUD_IN_TIMER
NUD_IN_TIMER

相邻子系统正在为此条目运行计时器,当状态不清楚时会发生这种情况。与之对应的基本状态有:

NUD_INCOMPLETE
NUD_DELAY
NUD_PROBE

The neighboring subsystem is running a timer for this entry, which happens when the status is unclear. The basic states that correspond to this are:

NUD_INCOMPLETE
NUD_DELAY
NUD_PROBE

让我们看一个示例,说明为什么派生状态在内核代码中很有用。当邻居实例被删除时,主机需要停止与该数据结构关联的所有待处理定时器。与其将邻居的状态与已知具有与其关联的待处理定时器的三个状态进行比较,不如NUD_IN_TIMER使用按位运算符定义邻居的状态并将其与其进行比较,这样会更干净&

Let's look at an example of why a derived state is useful in kernel code. When a neighbor instance is removed, the host needs to stop all the pending timers associated with that data structure. Instead of comparing the neighbor's state to the three states known to have a pending timer associated with them, it is just cleaner to define NUD_IN_TIMER and compare the neighbor's state against it using the bitwise operator &.

初始状态

Initial state

当创建邻居实例时,NUD_NONE默认情况下会为其分配状态,但是当创建是由显式用户命令引起时,状态也可以显式设置为不同的值(请参见第 29 章

When a neighbor instance is created, the NUD_NONE state is assigned to it by default, but the state can also be explicitly set to something different when the creation is caused by an explicit user command (see Chapter 29).

正如第 27 章“邻居初始化”部分所解释的,协议的方法还可以根据关联设备(例如,点对点)和 L3 地址(例如,广播)的特性来更改状态。constructor

As explained in the section "Neighbor Initialization" in Chapter 27, the protocol's constructor method may also change the state depending on the characteristics of the associated device (e.g., point-to-point) and L3 address (e.g., broadcast).

可达性确认

Reachability Confirmation

我们在“为什么地址静态分配不够”一节中看到,L3 到 L2 映射可能会发生变化。因此,如果信息有一段时间没有使用,定期确认缓存中存储的信息是有意义的。这称为可达性确认

We saw in the section "Why Static Assignment of Addresses Is Not Sufficient" that it is possible for an L3-to-L2 mapping to change. Because of this, it makes sense to confirm the information stored in the cache regularly, if the information has not been used for some time. This is called reachability confirmation.

请注意,可达性状态的变化不一定是由于“需要邻居协议的原因”一节中列出的原因造成的;路由器、网桥或其他网络设备可能刚刚遇到一些问题。当进行可达性确认时,缓存的信息将在假设它很可能仍然有效的情况下暂时使用。

Note that a change in reachability status is not necessarily due to the reasons listed in the section "Reasons That Neighboring Protocols Are Needed"; a router, bridge, or other network device may just be experiencing some problems. While the reachability confirmation is in progress, the cached information is temporarily used under the assumption that it is most likely still valid.

三个 NUD 状态NUD_STALENUD_DELAY、 和NUD_PROBE 支持可达性确认任务。使用这些状态的关键原因是,在需要将数据包发送到相关邻居之前,无需启动可达性确认过程。

The three NUD states NUD_STALE, NUD_DELAY, and NUD_PROBE support the task of reachability confirmation. The key reason for the use of these states is that there is no need to start a reachability confirmation process until a packet needs to be sent to the associated neighbor.

让我们再次定义这三种 NUD 状态的确切含义,然后看看可以确认映射的两种方式:

Let's define once again the exact meaning of these three NUD states, and then look at the two ways a mapping can be confirmed:

NUD_STALE
NUD_STALE

缓存包含邻居的地址,但后者在一定时间内尚未得到确认(参见第29章“ neigh_parms结构reachable_time部分的讨论)。下次向邻居发送数据包时,将启动可达性验证过程。

The cache contains the address of the neighbor, but the latter has not been confirmed for a certain amount of time (see the discussion of reachable_time in the section "neigh_parms Structure" in Chapter 29). The next time a packet is sent to the neighbor, the reachability verification process will be started.

NUD_DELAY
NUD_DELAY

该状态与 密切相关NUD_STALE,代表一种优化,可以减少请求请求的传输次数。

当数据包发送到其关联条目处于该NUD_STALE状态的邻居时,就会进入该状态。该NUD_DELAY状态代表外部源可以确认邻居可达性的时间窗口。最简单的外部确认是当相关邻居发送数据包时,从而表明它正在运行并且可访问。

此状态为上层网络层提供了一些时间来提供可达性确认,这可以减轻内核发送请求请求的负担,从而节省带宽和 CPU 使用率。这种状态可能看起来像是一个小优化,但如果你从大网络的角度思考,你可以想象它可以提供的增益。

如果未收到确认,则该条目将进入下一个状态 ,NUD_PROBE该状态通过显式请求请求或协议可能使用的任何其他机制来解析邻居的状态。

This state, closely tied to NUD_STALE, represents an optimization that can reduce the number of transmissions of solicitation requests.

This state is entered when a packet is sent to a neighbor whose associated entry is in the NUD_STALE state. The NUD_DELAY state represents a window of time where external sources could confirm the reachability of the neighbor. The simplest sort of external confirmation is when the neighbor in question sends a packet, thus indicating that it is running and accessible.

This state gives some time to the upper network layers to provide a reachability confirmation, which may relieve the kernel from sending a solicitation request and thus save both bandwidth and CPU usage. This state may look like a small optimization, but if you think in terms of big networks, you can imagine the gain it can provide.

If no confirmation is received, the entry is put into the next state, NUD_PROBE, which resolves the status of the neighbor through explicit solicitation requests or whatever other mechanism a protocol might use.

NUD_PROBE
NUD_PROBE

当邻居处于该NUD_DELAY 状态达到指定的时间并且未收到可达性证明时,其状态将更改为NUD_PROBE并开始请求过程。

When the neighbor has been in the NUD_DELAY state for the allotted amount of time and no proof of reachability has been received, its state is changed to NUD_PROBE and the solicitation process starts.

邻居的可达状态可以通过两种主要方式来确认。正如我们将看到的,这两种方法不具有相同级别的权限。他们是:

The reachability status of a neighbor can be confirmed in two main ways. As we will see, these two methods do not have the same level of authority. They are:

单播招标回复的确认
Confirmation from a unicast solicitation's reply

当您的主机收到针对其先前发出的请求请求的请求回复时,这意味着邻居收到了该请求并且能够发回回复;这反过来意味着它要么已经拥有您的 L2 地址,要么从您的请求中获知了您的地址(请参阅第 27 章中的“创建邻居条目”部分。这也意味着在两个方向上都有一条工作路径。但请注意,仅当请求的答复作为单播数据包发送时,这才是正确的。广播答复的接收会将状态移至而不是NUD_STALENUD_REACHABLE(你可以在第 28 章的“处理入口 ARP 数据包”一节中从 ARP 的角度找到更多关于此问题的讨论。)

When your host receives a solicitation reply in answer to a solicitation request it previously sent out, it means that the neighbor received the request and was able to send back a reply; this in turn means that either it already had your L2 address or it learned your address from your request (see the section "Creating a neighbour Entry" in Chapter 27. It also means that there is a working path in both directions. Note, however, that this is true only when the solicitation's reply is sent as a unicast packet. The reception of a broadcast reply would move the state to NUD_STALE rather than NUD_REACHABLE. (You can find more discussion of this from the standpoint of ARP in the section "Processing Ingress ARP Packets" in Chapter 28.)

外部确认
External confirmation

如果您的主机确定它收到了来自邻居的数据包以响应先前发送的内容,则可以假设邻居仍然可以访问。图 26-14显示了一个示例,其中主机 A 的 TCP 层在收到回复其 SYN 的 SYN/ACK 时确认主机 B 的可达性。请注意,如果主机 B 不是主机 A 的邻居,则从主机 B 接收到的 SYN/ACK 将确认主机 A 用于到达主机 B 的下一跳网关的可达性。

外部邻居可达性确认示例

图 26-14。外部邻居可达性确认示例

确认是通过 完成的dst_confirm,它确认用于将 SYN 数据包路由到主机 B 的路由表缓存条目的有效性。dst_confirm是 的一个简单包装器neigh_confirm,它完成我们之前描述的任务:它确认邻居的可达性,从而确认 L3 的可达性到 L2 映射。注意,neigh_confirm只更新neigh->confirmed时间戳;它将是实际将邻居条目的状态升级neigh_periodic_timer为 的函数(在邻居进入该状态时启动的计时器到期时执行NUD_DELAYNUD_REACHABLE[ * ]

请注意,图26-14中两个数据包之间的关联无法在IP层执行,因为后者不了解任何数据流。这就是 L4 层负责确认的原因。TCP SYN/ACK 交换只是提供外部确认的 L4 协议的一个示例。MSG_CONFIRM给定一个套接字,以及关联的路由缓存条目及其下一跳网关,用户空间应用程序可以通过使用传输调用(例如send和 )的选项来确认网关的可达性 sendmsg

虽然接收到请求的回复可以将状态移至 ,而NUD_REACHABLE不管当前状态如何,但仅当当前状态为 时才能使用外部确认NUD_STALE。这意味着如果条目刚刚创建并且处于状态NUD_INCOMPLETE,则不允许外部确认来确认邻居的可达性(见图26-13)。

If your host is sure it received a packet from the neighbor in response to something previously sent, it can assume the neighbor is still reachable. Figure 26-14 shows an example, where the TCP layer of Host A confirms the reachability of Host B when it receives a SYN/ACK in reply to its SYN. Note that if Host B was not a neighbor of Host A, the reception of the SYN/ACK from Host B would confirm the reachability of the next hop gateway used by Host A to reach Host B.

Figure 26-14. Example of external neighbor reachability confirmation

Confirmation is done via dst_confirm, which confirms the validity of the routing table cache entry used to route the SYN packet toward Host B. dst_confirm is a simple wrapper around neigh_confirm, which accomplishes the task we described earlier: it confirms the reachability of the neighbor and therefore the L3-to-L2 mapping. Note that neigh_confirm only updates the neigh->confirmed timestamp; it will be the neigh_periodic_timer function (which is executed by the expiration of the timer started when the neighbor entered the NUD_DELAY state) that actually upgrades the neighbor entry's state to NUD_REACHABLE.[*]

Note that the correlation between the two packets in Figure 26-14 could not be performed at the IP layer because the latter doesn't have any knowledge of data streams. This is why the L4 layer takes care of the confirmation. TCP SYN/ACK exchanges are only one example of an L4 protocol providing external confirmation. Given a socket, and therefore the associated routing cache entry and its next-hop gateway, a user-space application can confirm the reachability of the gateway by using the MSG_CONFIRM option with transmission calls such as send and sendmsg.

While the reception of a solicitation's reply can move the state to NUD_REACHABLE regardless of the current state, external confirmations can be used only when the current state is NUD_STALE. This means that if the entry had just been created and it was in the NUD_INCOMPLETE state, external confirmations would not be allowed to confirm the reachability of the neighbor (see Figure 26-13).

注意NUD_DELAY/NUD_PROBENUD_NONE可以导致 ,如图26-13NUD_REACHABLE所示;然而, from到 get to ,你需要完整的可达性证明,而 from / ,任何类型的确认就足够了。NUN_NONENUD_REACHABLENUD_DELAYNUD_PROBE

Note that NUD_DELAY/NUD_PROBE and NUD_NONE can lead to NUD_REACHABLE, as shown in Figure 26-13; however, from NUN_NONE to get to NUD_REACHABLE, you need full proof of reachability, while from NUD_DELAY/NUD_PROBE, any kind of confirmation is sufficient.




[ * ]另一种方法是不移动上层路由器的接口,而只是将一个 IP 地址添加到下层接口。

[*] An alternative would be not to move the upper router's interface and simply add one IP address to the lower interface.

[ * ]我们在第四部分中看到了桥接器和交换机如何在可能的情况下将帧仅定向到正确的主机,从而减少向 LAN 中的每个主机发送无用的帧。

[*] We saw in Part IV how bridges and switches manage to direct frames only to the right host when possible, reducing in this way the useless delivery of frames to every host in the LAN.

[ * ]在本书中,网桥和交换机用于指代同一类型的设备。详细信息请参见第四部分。

[*] In this book, bridges and switches are used to refer to the same type of device. See Part IV for more details.

[ * ]该图不包括 DECnet 和 ATM,因为本书未涉及它们。

[*] The figure does not include DECnet and ATM, because they are not covered in this book.

[ * ]这并不意味着通过在代理主机上启用代理,您也会自动启用转发。这两个功能是单独配置的,但代理需要转发才能正常工作。

[*] This does not mean that by enabling proxying on the proxy host, you also automatically enable forwarding. The two features are configured separately, but proxying requires forwarding to function properly.

[ * ]我在图 26-10中添加了桥接,以表明桥接是在相邻协议之前处理的,因此后者可能并不总是看到入口请求。桥接在第四部分中有详细描述。

[*] I added bridging to Figure 26-10 to show that bridging is handled before the neighboring protocols, and therefore the latter may not always see ingress solicitation requests. Bridging is described in detail in Part IV.

[ * ]从 L4 层接收确认和将状态设置为 之间的延迟NUD_REACHABLE 不会以任何方式影响流量。

[*] The delay between the reception of the confirmation from the L4 layer and the setting of the state to NUD_REACHABLE does not affect traffic in any way.

第 27 章相邻子系统:基础设施

Chapter 27. Neighboring Subsystem: Infrastructure

第26章中,我们看到了邻居协议需要解决的主要问题。您还了解到 Linux 内核将部分解决方案抽象为通用基础设施 由各种相邻协议共享。在本章中,我们将了解基础设施是如何设计的。特别是,我们将看到协议如何与通用基础设施接口,如何实现缓存和代理,以及外部子系统(例如高层协议)如何通知邻近协议有关有趣的事件。我们将通过描述 IPv4 等 L3 协议如何实际与其相邻协议交互,以及如何为等待地址解析的缓冲区实现排队来结束本章。

In Chapter 26, we saw the main problems that the neighboring protocols are asked to solve. You also learned that the Linux kernel abstracted out parts of the solution into a common infrastructure shared by various neighboring protocols. In this chapter, we will see how the infrastructure is designed. In particular, we will see how protocols interface to the common infrastructure, how caching and proxying are implemented, and how external subsystems such as higher-layer protocols notify the neighboring protocols about interesting events. We will conclude the chapter with a description of how L3 protocols such as IPv4 actually interface with their neighboring protocols, and how queuing is implemented for buffers awaiting address resolution.

主要数据结构

Main Data Structures

为了理解邻近基础设施的代码,我们首先需要描述邻近子系统中大量使用的一些数据结构,并了解它们如何相互交互。

To understand the code for the neighboring infrastructure, we first need to describe a few data structures used heavily in the neighboring subsystem, and see how they interact with each other.

这些结构的大多数定义可以在文件include/net/neighbour.h中找到。请注意,Linux 内核代码使用英式拼写neighbour来表示与该子系统相关的数据结构和函数。当泛指邻居时,本书坚持使用美式拼写,即 RFC 和其他官方文件中的拼写。

Most of the definitions for these structures can be found in the file include/net/neighbour.h. Note that the Linux kernel code uses the British spelling neighbour for data structures and functions related to this subsystem. When speaking generically of neighbors, this book sticks to the American spelling, which is the spelling found in RFCs and other official documents.

struct neighbour
struct neighbour

存储有关邻居的信息,例如 L2 和 L3 地址、NUD 状态、可到达邻居的设备等。请注意,条目不 neighbour与主机关联,而是与 L3 地址关联。一台主机可以有多个 L3 地址。例如,路由器以及其他系统具有多个接口,因此具有多个 L3 地址。

Stores information about a neighbor, such as the L2 and L3 addresses, the NUD state, the device through which the neighbor can be reached, etc. Note that a neighbour entry is associated not with a host, but with an L3 address. There can be more than one L3 address for a host. For example, routers, among other systems, have multiple interfaces and therefore multiple L3 addresses.

struct neigh_table
struct neigh_table

描述相邻协议的参数和功能。每个相邻协议都有一个该结构的实例。所有结构都插入到静态变量指向的全局列表中neigh_tables并受锁保护neigh_tbl_lock。此锁保护列表的完整性,但不保护每个条目的内容。

Describes a neighboring protocol's parameters and functions. There is one instance of this structure for each neighboring protocol. All of the structures are inserted into a global list pointed to by the static variable neigh_tables and protected by the lock neigh_tbl_lock. This lock protects the integrity of the list, but not the content of each entry.

struct neigh_parms
struct neigh_parms

一组参数,可用于在每个设备的基础上调整相邻协议的行为。由于大多数接口上可以启用多个协议(例如 IPv4 和 IPv6),因此一个neigh_parms结构可以与多个结构关联net_device

A set of parameters that can be used to tune the behavior of a neighboring protocol on a per-device basis. Since more than one protocol can be enabled on most interfaces (for instance, IPv4 and IPv6), more than one neigh_parms structure can be associated with a net_device structure.

struct neigh_ops
struct neigh_ops

代表 L3 协议(例如 IP 和 API)之间接口的一组函数,第 11 章dev_queue_xmit中介绍了 API ,并在接下来的“ L3 协议和相邻协议之间的通用接口”部分中进行了简要描述。虚拟功能可以根据使用它们的上下文(即邻居的状态,如第26 章中所述)而改变。

A set of functions that represents the interface between the L3 protocols such as IP and dev_queue_xmit, the API introduced in Chapter 11 and described briefly in the upcoming section "Common Interface Between L3 Protocols and Neighboring Protocols." The virtual functions can change based on the context in which they are used (that is, on the status of the neighbor, as described in Chapter 26).

struct hh_cache
struct hh_cache

缓存链路层标头以加快传输速度。将缓存的标头一次性复制到缓冲区中比逐一填充其字段要快。并非所有设备驱动程序都实现标头缓存。请参阅“ L2 标头缓存”部分。

Caches link layer headers to speed up transmission. It is faster to copy a cached header into a buffer in one shot than to fill in its fields one by one. Not all device drivers implement header caching. See the section "L2 Header Caching."

struct rtable
struct rtable

struct dst_entry
struct dst_entry

当主机需要路由数据包时,它首先查询其缓存,然后在缓存未命中的情况下查询路由表。主机每次查询路由表时,结果都会保存到缓存中。IPv4 路由缓存由rtable结构体组成。每个实例都与不同的目标 IP 地址关联。该结构的字段包括目标地址、下一跳(路由器)以及 用于存储与协议无关的信息rtable的结构类型。包括指向与下一跳相关的结构的指针。我在第 36 章详细介绍了数据结构dst_entrydst_entryneighbourdst_entry。在本章的其余部分中,我经常将dst_entry结构称为路由表缓存的元素,即使dst_entry实际上只是rtable结构的一个字段。

图 27-1显示了dst_entry结构如何链接到hh_cacheneighbour结构。

When a host needs to route a packet, it first consults its cache and then, in the case of a cache miss, it queries the routing table. Every time the host queries the routing table, the result is saved into the cache. The IPv4 routing cache is composed of rtable structures. Each instance is associated with a different destination IP address. Among the fields of the rtable structure are the destination address, the next hop (router), and a structure of type dst_entry that is used to store the protocol-independent information. dst_entry includes a pointer to the neighbour structure associated with the next hop. I cover the dst_entry data structure in detail in Chapter 36. In the rest of this chapter, I will often refer to dst_entry structures as elements of the routing table cache, even though dst_entry is actually only a field of the rtable structure.

Figure 27-1 shows how dst_entry structures are linked to hh_cache and neighbour structures.

邻近的代码还使用了一些其他的小型数据结构。例如, struct pneigh_entry由基于目的地的代理使用,并struct neigh_statistics用于收集有关相邻协议的统计信息。第一个结构在“充当代理”部分中描述,第二个结构在第 29 章的“统计”部分中描述。图27-2还包括以下数据结构类型,第 22 章和第 23有更详细的描述:

The neighboring code also uses some other small data structures. For instance, struct pneigh_entry is used by destination-based proxying, and struct neigh_statistics is used to collect statistics about neighboring protocols. The first structure is described in the section "Acting As a Proxy," and the second one is described in the section "Statistics" in Chapter 29. Figure 27-2 also includes the following data structure types, described in greater detail in Chapters 22 and 23:

dst_entry、neighbour 和 hh_cache 结构之间的关系

图 27-1。dst_entry、neighbour 和 hh_cache 结构之间的关系

Figure 27-1. Relationship among dst_entry, neighbour, and hh_cache structures

in_device,inet6_dev
in_device, inet6_dev

分别用于存储设备的IPv4和IPv6配置。

Used to store the IPv4 and IPv6 configurations of a device, respectively.

net_device
net_device

net_device内核识别的每个网络设备都有一个结构。参见第 8 章

There is one net_device structure for each network device recognized by the kernel. See Chapter 8.

图 27-2显示了最重要的数据结构之间的关系。现在看起来可能很混乱,但到本章结束时就会变得更有意义。

Figure 27-2 shows the relationships between the most important data structures. Right now it might seem a big mess, but it will make much more sense by the end of this chapter.

下面是图27-2所示的要点:

Here are the main points shown in Figure 27-2:

  • 在图的中央部分,您可以看到每个网络设备都有一个指向数据结构的指针,该数据结构保存设备上配置的每个 L3 协议的配置。如图所示,一台设备上配置了IPv6,两台设备上配置了IPv4。结构in_device(IPv4 配置)和inet6_dev结构(IPv6 配置)都包含指向其相邻协议(分别为 ARP 和 ND)使用的配置的指针。

    任何给定协议使用的所有neigh_parms结构都在一个单向列表中链接在一起,该列表的根存储在协议的neigh_table结构中。

  • In the central part of the figure, you can see that each network device has a pointer to a data structure that holds the configuration for each L3 protocol configured on the device. In the example shown in the figure, IPv6 is configured on one device and IPv4 is configured on both. Both the in_device structure (IPv4 configuration) and inet6_dev structure (IPv6 configuration) include a pointer to the configuration used by their neighboring protocols, respectively ARP and ND.

    All of the neigh_parms structures used by any given protocol are linked together in a unidirectional list whose root is stored in the protocol's neigh_table structure.

  • 该图的顶部和底部显示每个协议都保留两个哈希表。第一个,hash_buckets,缓存由协议解析或静态配置的 L3 到 L2 映射。第二个,phash_bucket存储那些被代理的 IP 地址,如“每设备代理和每目标代理”一节中所述。请注意,这phash_bucket不是缓存,因此它的元素不会过期,也不需要确认。各pneigh_entry结构

    数据结构的关系

    图 27-2。数据结构的关系

    包括一个指向其关联结构的指针(图 27-2中未描绘) net_device图 27-6给出了有关缓存结构的更多细节hash_buckets

  • The top and bottom of the figure show that each protocol keeps two hash tables. The first one, hash_buckets, caches the L3-to-L2 mappings resolved by the protocol or statically configured. The second one, phash_bucket, stores those IP addresses that are proxied, as described in the section "Per-Device Proxying and Per-Destination Proxying." Note that phash_bucket is not a cache, so its elements do not expire and don't need confirmation. Each pneigh_entry structure

    Figure 27-2. Data structures' relationships

    includes a pointer (not depicted in Figure 27-2) to its associated net_device structure. Figure 27-6 gives more detail on the structure of the cache hash_buckets.

  • 如果设备支持标头缓存,则每个neighbour实例都与一个或多个结构相关联。“ L2 标头缓存hh_cache”部分以及图 27-127-10给出了有关和结构之间关系的更多详细信息。neighbourhh_cache

  • Each neighbour instance is associated with one or more hh_cache structures, if the device supports header caching. The section "L2 Header Caching," and Figures 27-1 and 27-10, give more details about the relationship between neighbour and hh_cache structures.

L3 协议和邻近协议之间的公共接口

Common Interface Between L3 Protocols and Neighboring Protocols

Linux 内核有一个通用相邻层,dev_queue_xmit通过虚拟函数表 (VFT) 将 L3 协议连接到主 L2 传输函数 ( )。VFT 是 Linux 内核中经常使用的机制,允许子系统在不同时间使用不同的功能。相邻子系统的 VFT 被实现为名为 的数据结构neigh_opsops指向这些结构之一的指针作为每个结构中命名的字段嵌入neighbour

The Linux kernel has a generic neighboring layer that connects L3 protocols to the main L2 transmit function (dev_queue_xmit) via a virtual function table (VFT). A VFT is the mechanism frequently used in the Linux kernel for allowing subsystems to use different functions at different times. The VFT for the neighboring subsystem is implemented as a data structure named neigh_ops. A pointer to one of these structures is embedded as a field named ops in each neighbour structure.

VFT接口的灵活性允许不同的L3协议使用不同的邻居协议。这反过来允许不同的相邻协议表现得完全不同,同时允许相邻子系统在相邻协议和L3协议之间提供公共通用接口。

The flexibility of the VFT interface allows different L3 protocols to use different neighboring protocols. This in turn allows different neighboring protocols to behave quite differently while allowing the neighboring subsystem to provide a common generic interface between the neighboring protocols and the L3 protocols.

在本节中,我们将研究 L3 协议和相邻协议之间基于 VFT 的接口、使用 VFT 的优点、首次初始化时的优点,以及在邻居的生命周期中如何更新它。本节最后简要概述了用于控制 VFT 初始化的函数。为了更好地理解本节,请您首先阅读第 29 章中的“ neigh_ops 结构”部分。

In this section, we examine the VFT-based interface between the L3 protocols and the neighboring protocols, the advantages of using the VFT, when it is first initialized, and how it is updated during the lifetime of a neighbor. The section concludes with a brief overview of the functions used to control the initialization of the VFT. To better understand this section, you are invited to first read the section "neigh_ops Structure" in Chapter 29.

我们首先概述一下 VFT 中的例程是如何调用的。给定一个 neighbour实例及其嵌入的 VFT ,理论上可以像这样直接调用字段指向的neighbour->ops函数:output

Let's start with an overview of how the routines in the VFT are invoked. Given a neighbour instance and its embedded VFT neighbour->ops, the function to which the output field points could in theory be invoked directly like this:

    邻居->操作->输出
    neigh->ops->output

但在 Linux 代码中并没有找到这个结构,因为即使这样也不够通用。output结构体字段 中的函数neigh_ops只是执行类似任务的四个函数之一,每个函数在 中都有自己的字段neigh_ops。各个协议必须决定使用四个函数中的哪一个。正确的功能取决于事件、上下文以及接口和设备的配置。因此,为了使相邻基础设施与协议无关,该neighbour结构包含自己的output字段。各个协议将适当的功能从其中一个字段分配neigh->opsneigh->output。这使得代码更加简单和清晰。例如,不要这样做:

But this construct is not found in the Linux code because even this is not general enough. The function in the output field of the neigh_ops structure is only one of four functions that perform similar tasks, each function having its own field in neigh_ops. The individual protocol has to decide which of the four functions to use. The proper function depends on events, the context, and the configuration of the interface and device. So, to leave the neighboring infrastructure protocol-independent, the neighbour structure contains its own output field. The individual protocol assigns the proper function from one of the fields in neigh->ops to neigh->output. This allows the code to be simpler and clearer. For instance, instead of doing:

    如果 ( neighbour is not reachable)
        邻居->操作->输出(skb)
    别的
    如果 ( the device used to reach the neighbor can use cached headers)
      neigh->ops->hh_output(skb)
    别的
      neigh->ops->connected_output(skb)
    if (neighbour is not reachable)
        neigh->ops->output(skb)
    else
    if (the device used to reach the neighbor can use cached headers)
      neigh->ops->hh_output(skb)
    else
      neigh->ops->connected_output(skb)

邻近的基础设施可以调用:

the neighboring infrastructure can just call:

      邻居->输出
      neigh->output

只要neigh->output已由协议初始化为正确的neigh_ops方法即可。当然,每个相邻协议都使用自己的逻辑来初始化neigh->output;它不一定必须遵循此快照中的规则。

as long as neigh->output has been initialized by the protocol to the right neigh_ops method. Of course, each neighboring protocol uses its own logic to initialize neigh->output; it does not necessarily have to follow the rules in this snapshot.

当邻居被创建时,它的neighbour->ops 字段被初始化为正确的结构,如图27-3(a)neigh_ops所示。该分配在邻居的一生中不会改变。然而,如图 27-3(b)所示,在邻居结构的生命周期内,可以多次更改为不同的功能,由协议操作期间发生的事件驱动,并且由用户命令驱动(不太常见)。以下各节将详细介绍图 27-3中所示的两种初始化。neigh->output

When a neighbor is created, its neighbour->ops field is initialized to the proper neigh_ops structure, as shown in Figure 27-3(a). This assignment does not change during the neighbor's lifetime. However, as depicted in Figure 27-3(b), neigh->output can be changed to different functions many times during the lifetime of the neighbor structure, driven both by events that take place during protocol operation, and (much less often) by user commands. The following sections will go into detail on both initializations shown in Figure 27-3.

(a) neigh->ops的初始化; (b) neigh->output的初始化

图 27-3。(a) neigh->ops的初始化;(b) neigh->output的初始化

Figure 27-3. (a) Initialization of neigh->ops; (b) initialization of neigh->output

neigh->ops的初始化

Initialization of neigh->ops

在某些类型的设备上,初始化图27-3(b)中列出的功能 可以进一步优化以加速传输。例如,这些包括第 26 章“特殊情况”部分中描述的情况,其中不需要将 L3 地址映射到 L2 地址。在这些情况下,几乎可以完全绕过相邻子系统,只需要第 11 章中描述的功能。协议代码需要知道这种细节,但一般的邻近基础设施并不知道,因此协议可以初始化为 queue_xmitneigh->outputneigh->ops->queue_xmit并且一切对于上层来说都是透明的。简单的!

On certain types of devices, the initialization of the functions listed in Figure 27-3(b) could be further optimized to speed up transmissions. These include, for instance, the situations described in the section "Special Cases" in Chapter 26, where there is no need to map an L3 address to an L2 address. In those cases, the neighboring subsystem can almost be bypassed altogether and only the queue_xmit function described in Chapter 11 is needed. The protocol code needs to know this kind of detail, but the general neighboring infrastructure does not, so the protocol can just initialize neigh->output to neigh->ops->queue_xmitand everything remains transparent to the upper layers. Simple!

因此,每个协议都提供了三种不同的 neigh_opsVFT 实例:

For this reason, each protocol provides for three different instances of the neigh_ops VFT:

  • 可以在任何上下文中使用的通用表 ( )。这是通常用于处理需要解析 L2 地址的邻居的方法。xxx _generic_ops

  • A generic table that can be used in any context (xxx _generic_ops). This is the one that is normally used to handle neighbors whose L2 addresses need to be resolved.

  • 当设备驱动程序提供自己的一组函数来操作 L2 标头时,可以使用一组优化的函数,从而利用缓存标头的使用带来的加速 ( ) 。xxx _hh_ops

  • An optimized set of functions that can be used when the device driver provides its own set of functions to manipulate L2 headers and thus take advantage of the speedup coming from the use of cached headers (xxx _hh_ops).

  • 当设备不需要将 L3 地址映射到 L2 地址时可以使用的表 ( )。一个例子是使用带有原始 IP 封装的 ISDN。xxx _direct_ops

  • A table that can be used when the device does not need to map L3 addresses to L2 addresses (xxx _direct_ops). An example is the use of ISDN with raw IP encapsulation.

创建邻居实例时,协议neigh_ops根据多个因素将 VFT 初始化为正确的实例。请参阅第 29 章中的“ neigh_ops 结构”部分。

When the neighbor instance is created, the protocol initializes the neigh_ops VFT to the right instance depending on several factors. See the section "neigh_ops Structure" in Chapter 29.

在 IPv4/ARP 的特定情况下,neigh_ops调用的第四个实例arp_broken_ops用于初始化neighbour与旧设备关联的那些实例,这些旧设备尚未适应新的相邻基础设施,因此无法正常工作。这再次显示了邻近基础设施的通用性:通过以neigh_ops正确的方式初始化 VFT,内核甚至能够使用旧的 ARP 代码。

In the specific case of IPv4/ARP, a fourth instance of neigh_ops called arp_broken_ops is used to initialize those neighbour instances associated with old devices that have not been adapted to the new neighboring infrastructure and therefore would not work otherwise. This once again shows how generic the neighboring infrastructure is: by initializing the neigh_ops VFT in the right way, the kernel is even able to use the old ARP code.

neigh->output 和 neigh->nud_state 的初始化

Initialization of neigh->output and neigh->nud_state

邻居 ( ) 的状态neigh->nud_stateneigh->output函数相互依赖。当nud_state发生变化时,output往往要进行相应的更新。举一个简单的例子,如果状态变得陈旧,则需要确认可达性。但邻近的基础设施不会浪费时间立即确认可达性;可能没有更多的交通,并且努力可能会被浪费。相反,相邻基础设施停止使用 output盲目插入当前地址的优化功能,并切换到output检查地址的较慢功能。在图 27-3(a)的示例中,我们将从c1更改connected_output奥1

The state of a neighbor (neigh->nud_state) and the neigh->output function depend on each other. When nud_state changes, output often has to be updated accordingly. As a simple example, if the state becomes stale, confirmation of reachability is required. But the neighboring infrastructure doesn't waste time confirming reachability right away; there might be no further traffic and the effort might be wasted. Instead, the neighboring infrastructure stops using the optimized output function that blindly plugs in the current address, and switches to the slower output function that checks the address. In the example in Figure 27-3(a), we would change connected_output from c1 to o1.

为了帮助理解本节,请查看第 26 章中的图 26-13,了解根据设备类型和协议事件可能出现的状态。neigh->nud_state

For help in understanding this section, check Figure 26-13 in Chapter 26 for the possible states that neigh->nud_state can assume, based on device type and protocol events.

相邻子系统提供了一个通用例程 ,neigh_update它将相邻子系统移动到作为输入参数提供的状态。本章后面的部分将neigh_update详细描述,但让我们首先看看最常见的状态更改以及可以直接或通过调用来neigh_update处理它们的帮助程序例程。

The neighboring subsystem provides a generic routine, neigh_update, that moves a neighbor to the state provided as an input argument. A later section in this chapter describes neigh_update in detail, but let's first look at the most common changes of state and the helper routines that can be called, either directly or via neigh_update, to take care of them.

让我们从最常见的情况开始:需要邻居协议的设备、不属于第 26 章中描述的任何特殊情况的地址以及由转换引起的状态更改(也就是说,我们排除创建和删除)。[ * ] 第 26 章中的图 26-12可以简化为 图 27-4。该图还显示了处理转换的内核函数。然而,并非所有通过调用进行的转换neigh_update已显示,因为大多数都太通用,无法为该图添加任何价值;仅显示由接收请求答复触发的转换。

Let's start with the most common case: a device that needs a neighboring protocol, an address that does not belong to any of the special cases described in Chapter 26, and a change of state caused by a transition (that is, we exclude creation and deletion).[*] Figure 26-12 in Chapter 26 can then be simplified to produce Figure 27-4. The figure also shows the kernel functions where the transitions are handled. However, not all of the transitions made by calls to neigh_update are shown, because most are too generic to add any value to the figure; only the transition triggered by the reception of a solicitation reply is shown.

至少已解决一次的邻居可能发生的状态转换

图 27-4。至少已解决一次的邻居可能发生的状态转换

Figure 27-4. Possible state transitions for a neighbor that has been resolved at least once

请注意,图 27-4中的一些转换是异步的:它们由计时器处理,因此由时间戳比较触发。[ * ]其他转换由协议同步处理(例如, neigh_event_send [ ])。

Note that some of the transitions in Figure 27-4 are asynchronous: they are taken care of by a timer and are therefore triggered by timestamp comparisons.[*] Other transitions are taken care of synchronously by the protocols (e.g., neigh_event_send []).

常见的状态变化:neigh_connect 和 neigh_suspect

Common state changes: neigh_connect and neigh_suspect

邻居进入该NUD_REACHABLE状态的主要方式(均在第 26 章中描述)是:

The main ways a neighbor can enter the NUD_REACHABLE state (all described in Chapter 26) are:

收到询价回复
Reception of a solicitation reply

当收到请求答复时,无论是为了第一次解析映射还是为了确认状态中的邻居NUD_PROBE,协议都会neigh->nud_state通过 进行更新neigh_update。此更新是同步的并且立即发生。

When a solicitation reply is received, either to resolve a mapping for the first time or to confirm a neighbor in the NUD_PROBE state, the protocol updates neigh->nud_state via neigh_update. This update is synchronous and happens right away.

L4确认
L4 confirmation

第一次neigh_timer_handler在收到L4可达性确认后执行,状态变为NUD_REACHABLE(参见第26章“可达性确认”一节)。L4 确认是异步的,可能会稍微延迟。

The first time neigh_timer_handler is executed after the reception of an L4 reachability confirmation, the state is changed to NUD_REACHABLE (see the section "Reachability Confirmation" in Chapter 26). An L4 confirmation is asynchronous and may be slightly delayed.

手动配置
Manual configuration

neighbour当用户通过系统管理命令创建新的结构时,该命令可以指定状态,并且NUD_REACHABLE是有效状态。在本例中,neigh_connect是通过 调用的neigh_update

When a new neighbour structure is created by the user through a system administration command, this command can specify the state, and NUD_REACHABLE is a valid state. In this case, neigh_connect is invoked via neigh_update.

每当NUD_REACHABLE进入该状态时,相邻基础设施都会调用该neigh_connect函数以使该neigh->output函数指向neigh_ops->connected_output

Whenever the NUD_REACHABLE state is entered, the neighboring infrastructure calls the neigh_connect function to make the neigh->output function point to neigh_ops->connected_output.

当状态中的邻居 NUD_REACHABLE移动到NUD_STALENUD_DELAY,或者简单地初始化为与 中的状态之一不同的状态NUD_CONNECTED(例如,通过调用neigh_update)时,内核会调用neigh_suspect以强制确认可达性(请参阅“可达性确认”部分)第 26 章)。通过设置为来neigh_suspect做到这一点。neighbour->outputneigh_ops->output

When a neighbor in the NUD_REACHABLE state moves to NUD_STALE or NUD_DELAY, or is simply initialized to a state different from one of the states in NUD_CONNECTED (for example, by a call to neigh_update), the kernel invokes neigh_suspect to enforce confirmation of reachability (see the section "Reachability Confirmation" in Chapter 26). neigh_suspect does this by setting neighbour->output to neigh_ops->output.

neigh_connectneigh_suspect更新链接到输入实例的所有结构的neighbour->output和函数(见图27-1)。然而,这两个函数都不会更新实例的 NUD 状态,因为这已经由它们的调用者处理了。在本章后面,我将使用“连接邻居”和“怀疑邻居”的形式来分别调用该邻居的和。neighbour->hh_outputhh_cacheneighbourneighbourneigh_connectneigh_suspect

Both neigh_connect and neigh_suspect also update the neighbour->output and neighbour->hh_output functions of all of the hh_cache structures linked to the input neighbour instance (see Figure 27-1). Neither function, however, updates the NUD state of a neighbour instance, because that is already taken care of by their callers. Later in this chapter I'll use the forms "connect the neighbor" and "suspect the neighbor" to refer to the invocation of neigh_connect and neigh_suspect, respectively, for that neighbor.

某些转换(NUD 状态的更改)可以在实例的生命周期内随时发生并且多次发生neighbour。其他的只能发生一次。有了一些网络知识,就不难查看 第 26 章中的图 26-13并识别属于这两个类别的转换。对于那些初始化为永久状态的实例(例如),可以立即初始化并且永远不会改变。neighbourNUD_NOARPneigh->outputneigh_ops->connected

Some transitions (changes of NUD state) can happen at any time and more than once during the lifetime of a neighbour instance. Others can take place only once. With some knowledge of networking, it is not hard to look at Figure 26-13 in Chapter 26 and identify the transitions that belong to each of the two categories. For those neighbour instances initialized to permanent states (for instance, NUD_NOARP), neigh->output can be initialized to neigh_ops->connected right away and it will never change.

用于 neigh->output 的例程

Routines used for neigh->output

如上一节所述,neigh->output 是由邻居函数初始化的,随后通过两个例程和constructor 来作为协议事件的结果进行操作。始终设置为 的虚拟函数之一。本节列出了可以分配给虚拟功能的功能 。该函数实际上并不是相邻子系统的一部分,在net/core/dev.c中定义。其他例程在net/core/neighbour.c中定义。neigh_connectneigh_suspectneigh->outputneigh_opsneigh_opsdev_queue_xmit

As explained in the previous section, neigh->output is initialized by the neighbor's constructor function, and later is manipulated as a consequence of protocol events via the two routines neigh_connect and neigh_suspect. neigh->output is always set to one of the virtual functions of neigh_ops. This section lists the functions that can be assigned to the neigh_ops virtual functions. The dev_queue_xmit function, which is not really part of the neighboring subsystem, is defined in net/core/dev.c. The other routines are defined in net/core/neighbour.c.

dev_queue_xmit
dev_queue_xmit

无论使用何种设备或使用 L2 和 L3 协议,L3 层在传输数据包时始终调用此函数。当在出口设备上传输所需的所有信息都存在并且相邻子系统不需要做额外的工作时,相邻协议将函数指针初始化neigh_ops为to 。dev_queue_xmit如果您查看第 28 章arp_direct_ops,您会发现所有四个传输虚拟功能均设置为。该功能在第 11 章中进行了描述。dev_queue_xmit

The L3 layer always calls this function when transmitting a packet, regardless of the kind of device or L2 and L3 protocols used. A neighboring protocol initializes the function pointers of neigh_ops to dev_queue_xmit when all the information needed to transmit on the egress device is present and there is no extra work for the neighboring subsystem to do. If you look at arp_direct_ops in Chapter 28, you can see that all four transmission virtual functions are set to dev_queue_xmit. That function is described in Chapter 11.

neigh_connected_output
neigh_connected_output

该函数只是填充 L2 标头,然后调用neigh_ops->queue_xmit. 因此,它期望 L2 地址能够被解析。neighbour它由该州的结构使用NUD_CONNECTED

This function just fills in the L2 header and then calls neigh_ops->queue_xmit. Therefore, it expects the L2 address to be resolved. It is used by neighbour structures in the NUD_CONNECTED state.

neigh_resolve_output
neigh_resolve_output

该函数在发送之前将 L3 地址解析为 L2 地址,因此在关联尚未准备好或需要确认时使用。除了第26章“特殊情况”部分中的情况外,通常是创建新 结构并且需要解析其L3地址时使用的默认例程。neigh_resolve_outputneighbour

This function resolves the L3 address to the L2 address before transmitting, so it is used when that association is not ready yet or needs to be confirmed. Except for the situations in the section "Special Cases" in Chapter 26, neigh_resolve_output is usually the default routine used when a new neighbour structure is created and its L3 address needs to be resolved.

neigh_compat_output
neigh_compat_output

该函数的存在是为了向后兼容。dev_queue_xmit在引入邻近基础设施之前,即使 L2 地址尚未准备好也可以进行呼叫。

This function is present for backward compatibility. Before the neighboring infrastructure was introduced, it was possible to call dev_queue_xmit even if the L2 address was not ready yet.

neigh_blackhole
neigh_blackhole

此函数用于处理neighbour由于有人仍然持有对结构的引用而无法删除结构的临时情况。neigh_blackhole丢弃输入中收到的任何数据包。这对于确保不会尝试将数据包传输到邻居是必要的,因为邻居的数据结构即将被删除。请参阅“邻居删除”部分。

This function is used to handle the temporary case where a neighbour structure cannot be removed because someone is still holding a reference to it. neigh_blackhole discards any packet received in input. This is necessary to ensure that no attempt to transmit a packet to the neighbor will take place, because the neighbor's data structures are about to be removed. See the section "Neighbor Deletion."

第28章中的“邻居结构的初始化”部分展示了ARP如何使用这些函数来初始化VFT的不同实例。函数所做的选择也显示在图27-13的流程图中。neigh_ops

The section "Initialization of a neighbour Structure" in Chapter 28 shows how ARP uses these functions to initialize the different instances of the neigh_ops VFT. The choices made by the functions are also shown in the flowchart in Figure 27-13.

更新邻居信息:neigh_update

Updating a Neighbor's Information: neigh_update

neigh_update定义在net/core/neighbour.c中,是一个通用函数,可用于更新结构的链路层地址neighbour。这是它的原型,简单描述一下输入参数:

neigh_update, defined in net/core/neighbour.c, is a generic function that can be used to update the link layer address of a neighbour structure. This is its prototype, with a brief description of the input parameters:

    int neigh_update(结构邻居 *neigh, const u8 *lladdr, u8 新,
                     u32 标志)
    int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
                     u32 flags)
neigh
neigh

指向neighbour要更新的结构的指针。

Pointer to the neighbour structure to update.

lladdr
lladdr

新的链路层 (L2) 地址。lladdr可能并不总是初始化为新值。例如,当neigh_update调用 来删除neighbour结构时(通过将其状态设置为,如“邻居删除NUD_FAILED”部分中所述,它会向 传递 NULL 值。lladdr

New link layer (L2) address. lladdr may not always be initialized to a new value. For instance, when neigh_update is called to delete a neighbour structure (by setting its state to NUD_FAILED, as described in the section "Neighbor Deletion," it is passed a NULL value for lladdr.

new
new

新的 NUD 状态。

New NUD state.

flags
flags

用于传达诸如是否可以覆盖现有链路层地址等信息。以下是可用标志,来自include/net/neighbour.h

Used to convey information such as whether an existing link layer address can be overridden, etc. Here are the available flags, from include/net/neighbour.h:

NEIGH_UPDATE_F_ADMIN
NEIGH_UPDATE_F_ADMIN

行政变更。这意味着更改源自用户空间命令(请参阅第 29 章中的“邻居的系统管理”部分)。

Administrative change. This means the change derives from a user-space command (see the section "System Administration of Neighbors" in Chapter 29).

NEIGH_UPDATE_F_OVERRIDE
NEIGH_UPDATE_F_OVERRIDE

当前的 L2 地址可以被覆盖lladdr。管理更改使用此标志来区分 替换添加命令等(请参见第 29 章中的表 29-1)。协议代码可以使用此标志来强制 L2 地址的最小生命周期(例如,请参见第28 章中的“最终公共处理”部分)。

The current L2 address can be overridden by lladdr. Administrative changes use this flag to distinguish between replace and add commands, among other things (see Table 29-1 in Chapter 29). Protocol code can use this flag to enforce a minimum lifetime for an L2 address (see, for example, the section "Final Common Processing in Chapter 28).

接下来的三个标志仅由 IPv6 代码使用:

The next three flags are used only by IPv6 code:

NEIGH_UPDATE_ISROUTER
NEIGH_UPDATE_ISROUTER

邻居是路由器。该标志用于初始化 NTF_ROUTER中的IPv6 标志neighbour->flags

The neighbor is a router. This flag is used to initialize the IPv6 flag NTF_ROUTER in neighbour->flags.

NEIGH_UPDATE_F_OVERRIDE_ISROUTER
NEIGH_UPDATE_F_OVERRIDE_ISROUTER

IPv6NTF_ROUTER标志可以被覆盖。

The IPv6 NTF_ROUTER flag can be overridden.

NEIGH_UPDATE_F_WEAK_OVERRIDE
NEIGH_UPDATE_F_WEAK_OVERRIDE

如果输入中提供的链路层地址lladdr与邻居 的当前已知链路层地址不同 neigh->ha,则该地址是可疑的(即,其状态被移动NUD_STALE以触发可达性确认)。

If the link layer address lladdr supplied in input differs from the current known link layer address of the neighbor neigh->ha, the address is suspected (i.e., its state is moved to NUD_STALE so that reachability confirmation is triggered).

IPv6 的 ND 协议使用协议标头中的标志,这些标志可以影响刚刚列出的标志的设置NEIGH_UPDATE_F_ XXXneigh_update接下来的讨论将跳过涉及仅 IPv6 标志的部分。

The IPv6's ND protocol uses flags in the protocol header that can influence the setting of the NEIGH_UPDATE_F_ XXX flags just listed. The discussion that follows skips over the parts of neigh_update that deal with the IPv6-only flags.

neigh_update所有管理接口都使用它来更改结构的链路层地址,如第 29 章中的图 29-1neighbour所示。该函数也可以被相邻协议本身使用,但它并不是唯一改变状态的函数。

neigh_update is used by all of the administrative interfaces to change the link layer address of a neighbour structure, as shown in Figure 29-1 in Chapter 29. The function can also be used by the neighboring protocols themselves, but it is not the only function that changes state.

图 27-5(a)27-5(b)neigh_update显示了的内部结构的高级描述 。该流程图分为不同的区域,每个区域负责不同的任务:

Figures 27-5(a) and 27-5(b) show a high-level description of neigh_update's internals. The flowchart is divided into different areas, each area taking care of a different task:

  • 健全性检查

  • Sanity checks

  • 更改应用于当前状态不是 NUD_VALID 的邻居

  • Changes applied to a neighbor whose current state is not NUD_VALID

  • 选择用于应用于当前状态为 NUD_VALID 的邻居的更改的 L2 地址

  • Selection of the L2 address to use for a change applied to a neighbor whose current state is NUD_VALID

  • 设置新的链路层地址

  • Setting a new link layer address

  • NUD状态的改变

  • Change of NUD state

  • 处理arp_queue队列

  • Handling an arp_queue queue

以下小节详细解释了该代码。

The following subsections explain the code in detail.

neigh_update优化

neigh_update optimization

在改变邻居的状态之前,neigh_update首先检查是否可以避免这种改变。如果满足以下两个条件,优化将放弃状态更改(请参阅 (c)):

Before changing the state of a neighbor, neigh_update first checks to see whether it is possible to avoid the change. An optimization discards the change of state if both of the following conditions are met (see (c)):

  • 链路层地址没有被修改(即输入lladdr与当前相同neigh->ha)。

  • The link layer address has not been modified (that is, the input lladdr is the same as the current neigh->ha).

  • 新状态为NUD_STALE,当前状态为NUD_CONNECTED,这意味着当前状态实际上比新状态更好。

  • The new state is NUD_STALE and the current one is NUD_CONNECTED, which means that the current state is actually better than the new one.

neigh_update函数

图 27-5a。neigh_update函数

Figure 27-5a. neigh_update function

初始 neigh_update 操作

Initial neigh_update operations

neigh_update在本节中,我们将跟踪它在处理当前状态 ( neighbour->nud_state) 和请求状态( new参数)的各种值时所做的决策。

In this section, we trace the decisions made by neigh_update as it handles various values for the current state (neighbour->nud_state) and the requested state (the new parameter).

neigh_update函数

图 27-5b。neigh_update函数

Figure 27-5b. neigh_update function

只有管​​理命令 ( NEIGH_UPDATE_F_ADMIN) 才能更改当前处于NUD_NOARPNUD_PERMANENT状态的邻居的状态。如果违反这些约束,开始时的健全性检查neigh_update会导致它立即退出。

Only administrative commands (NEIGH_UPDATE_F_ADMIN) can change the state of a neighbor that is currently in the NUD_NOARP or NUD_PERMANENT state. A sanity check at the beginning of neigh_update causes it to exit right away if these constraints are violated.

当新状态new不是有效状态时(如果是NUD_NONE或),邻居计时器将停止(如果它正在运行),并且如果旧状态为NUD_INCOMPLETE,则该条目将被标记为可疑(即,需要可达性确认) 。请参阅“ neigh->output 和 neigh->nud_state 的初始化”部分。当新状态有效时,如果新状态需要,则邻居定时器将重新启动()。neigh_suspectNUD_CONNECTEDNUD_IN_TIMER

When the new state new is not a valid one—if it is NUD_NONE or NUD_INCOMPLETE—the neighbor timer is stopped if it is running, and the entry is marked suspect (that is, requiring reachability confirmation) through neigh_suspect if the old state was NUD_CONNECTED. See the section "Initialization of neigh->output and neigh->nud_state." When the new state is a valid one, the neighbor timer is restarted if the new state requires it (NUD_IN_TIMER).

neigh_update被要求将 NUD 状态更改为与当前状态不同的值时(通常是这种情况),它需要检查状态是否从包含的值更改为另一个不包含的值NUD_VALIDNUD_VALID请记住,这NUD_VALID是派生状态,包括多个NUD_ XXX值)。特别是,当旧状态为 not NUD_VALID而新状态为 时NUD_VALID,主机必须传输在邻居arp_queue队列中等待的所有数据包。由于执行此操作时邻居的状态可能会发生变化(因为主机可能是对称多处理或 SMP 系统),因此在发送每个数据包之前会重新检查邻居的状态。

When neigh_update is asked to change the NUD state to a value different from the current one, which is normally the case, it needs to check whether the state is changing from a value included in NUD_VALID to another value not in NUD_VALID (remember that NUD_VALID is a derived state that includes multiple NUD_ XXX values). In particular, when the old state was not NUD_VALID and the new one is NUD_VALID, the host has to transmit all of the packets that are waiting in the neighbor's arp_queue queue. Since the state of the neighbor could change while doing this (because the host may be a symmetric multiprocesing, or SMP, system), the state of the neighbor is rechecked before sending each packet.

链路层地址变化

Changes of link layer address

调用的原因neigh_update是改变NUD状态,但它也可以改变到达邻居的目的链路层地址。如果提供了新的链路层地址(即,如果参数lladdr不为 NULL)并且输入参数flags允许,则该函数将执行此操作。当链路层地址改变时,所有缓存的头都需要相应更新。这是由 处理的neigh_update_hhs

The reason for calling neigh_update is to change the NUD state, but it can also change the destination link layer address by which a neighbor is reached. The function will do this if a new link layer address is provided (that is, if the lladdr parameter is not NULL) and if the input parameter flags allows it. When the link layer address is changed, all of the cached headers need to be updated accordingly. This is taken care of by neigh_update_hhs.

当没有提供链路层地址 neigh_update(即为lladdrNULL)并且当前 NUD 状态不是有效状态时,neigh_update丢弃输入帧skb并返回错误(如果没有有效链路层地址,则不会应用状态更改)对于邻居)。

When no link layer address is supplied to neigh_update (i.e., lladdr is NULL), and the current NUD state is not a valid one, neigh_update discards the input frame skb and returns with an error (no change of state is applied if there is no valid link layer address for the neighbor).

给 arpd 的通知

Notifications to arpd

一些拥有大型网络的站点选择通过名为arpd的用户空间守护进程来管理 ARP 请求,而不是让内核来执行此操作。当内核编译为支持arpd,并且配置了其使用(即app_probes> 0)时,neigh_update 通知守护程序有关以下事件: [ * ]

Some sites with large networks choose to manage ARP requests through a user-space daemon called arpd instead of making the kernel do it. When the kernel is compiled with support for arpd, and its use is configured (that is, app_probes > 0), neigh_update notifies the daemon about the following events: [*]

  • 当状态从NUD_VALID无效状态修改为无效状态时

  • When a state is modified from NUD_VALID to a state that is not valid

  • 当链路层地址改变时

  • When the link layer address is changed

邻近基础设施的一般任务

General Tasks of the Neighboring Infrastructure

本节介绍了在深入研究相邻基础设施中的特定功能之前您应该熟悉的一些一般概念: 缓存、引用计数和计时器。

This section describes a few general concepts that you should be familiar with before delving into specific functions within the neighboring infrastructure: caching , reference counting, and timers.

缓存

Caching

相邻层实现两种缓存:

The neighboring layer implements two kinds of caching:

邻居映射
Neighbor mappings

与可多次使用的任何其他类型的数据一样,缓存 L3 到 L2 映射的结果是有意义的。否定结果(尝试解析地址失败)不会被缓存。但是neighbour与失败映射关联的结构被设置为状态 NUD_FAILED,以便垃圾收集计时器可以清理它们(请参阅“垃圾收集”部分)。

As with any other kind of data that can be used multiple times, it makes sense to cache the results of the L3-to-L2 mappings. Negative results (where an attempt to resolve the address failed) are not cached. But the neighbour structures associated with failed mappings are set to the NUD_FAILED state so that the garbage collection timer can clean them up (see the section "Garbage Collection").

L2 标头
L2 headers

相邻基础设施缓存 L2 标头,以加快将 L3 数据包封装到 L2 帧中所需的时间。否则,基础设施必须一一初始化 L2 标头的每个字段。

The neighboring infrastructure caches L2 headers to speed up the time required to encapsulate an L3 packet into an L2 frame. Otherwise, the infrastructure would have to initialize each field of the L2 header one by one.

因为邻居映射的缓存对于邻居子系统的操作至关重要,本节详细介绍。(后面的“ L2 报头缓存”部分描述了 L2 报头缓存。)结构体的内容在第 29 章“邻居结构neighbour 部分中描述,结构体的创建和删除在本章后面的章节中描述。在这里,我们将停留在更高的层次上,描述这些结构如何被邻近的基础设施组织和访问。

Because the caching of neighbor mappings is central to the operation of the neighboring subsystem , this section describes it in detail. (The later section "L2 Header Caching" describes L2 header caching.) The contents of a neighbour structure are described in the section "neighbour Structure" in Chapter 29, and the structure's creation and deletion are described in later sections in this chapter. Here we will stay at a higher level, describing how those structures are organized and accessed by the neighboring infrastructure.

相邻的基础设施将neighbour 结构放入缓存中,每个协议一个,这些结构被实现为典型的哈希表,其中碰撞到同一存储桶的元素被链接到一个单链表中。新元素将添加到列表的头部(请参阅“ neigh_create 函数的参数neigh_create”部分中的函数)。将元素分配到存储桶中的哈希函数的输入是 L3 地址、关联设备以及定期重新计算的随机值,以降低假设的拒绝服务 (DoS) 攻击的有效性。 图27-6显示了缓存的结构。在图27-2,您可以看到它与其他关键数据结构的关系,例如每个协议的neigh_table结构。

The neighboring infrastructure places neighbour structures into caches, one per protocol, which are implemented as typical hash tables where elements that collide into the same bucket are linked into a singly linked list. New elements are added at the head of the lists (see the function neigh_create in the section "The neigh_create Function's Parameters"). The inputs to the hash function that distributes elements into buckets are the L3 address, the associated device, and a random value that is recomputed regularly to reduce the effectiveness of a hypothetical Denial of Service (DoS) attack. Figure 27-6 shows the structure of the cache. In Figure 27-2, you can see its relationship to other key data structures, such as the per-protocol neigh_table structure.

哈希表分别用neigh_hash_alloc和来分配和释放neigh_hash_free。每个哈希表都是在协议初始化时创建的,大小为两个元素(请参阅 参考资料neigh_table_init)。当表中元素的数量大于存储桶的数量时,表将按如下方式重新组织。首先,表的大小加倍(因此,哈希表的大小始终是 2 的幂)。

Hash tables are allocated and freed with neigh_hash_alloc and neigh_hash_free, respectively. Each hash table is created with a size of two elements at protocol initialization time (see neigh_table_init). When the number of elements in the table grows bigger than the number of buckets, the table is reorganized as follows. First, the size of the table is doubled (thus, the size of the hash table is always a power of 2).

邻居的缓存

图 27-6。邻居的缓存

Figure 27-6. neighbour's cache

重新计算用于散列的随机值。最后,使用前面提到的相同变量:L3 地址、设备和随机数,在整个表中重新分配元素。哈希表的扩展由 执行 neigh_hash_grow,必要时调用neigh_create

The random value used for hashing is recalculated. Finally, the elements are redistributed throughout the table using the same previously mentioned variables: L3 address, device, and random number. This extension of the hash table is performed by neigh_hash_grow, which is called by neigh_create when necessary.

请注意,哈希表的扩展很容易触发。因此,每个桶很少有超过一两个结构。

Note that extension of the hash table is easily triggered. Therefore, it rarely has more than one or two structures per bucket.

表中元素的最大数量由“垃圾收集gc_thresh X”部分中描述的变量控制。需要这些限制来防止可能的 DoS 攻击。

The maximum number of elements in a table is controlled by the gc_thresh X variables described in the section "Garbage Collection." These limits are needed to prevent possible DoS attacks.

当“邻居系统”需要在哈希表中搜索邻居时,搜索关键字是目的L3地址(primary_key)以及dev可以到达邻居的设备( )。由于不同的协议可能使用不同长度的密钥,因此常见的查找 API 需要考虑密钥长度。因此,密钥长度存储在该neigh_table结构中。

When the "neighboring system" needs to search a hash table for a neighbor, the search key is the destination L3 address (primary_key) together with the device (dev) through which the neighbor can be reached. Because different protocols may use keys of different lengths, the common lookup APIs need to take into account the key length. Therefore, the key length is stored in the neigh_table structure.

用于查询邻居协议缓存的主要函数是neigh_lookup。还有另外两个,都是围绕 的包装器 ,它们可以在查找失败时neigh_lookup强制创建条目,也可以根据输入参数决定是否创建条目。neighbour下面简单介绍一下这三个例程:

The main function used to query a neighbor protocol's cache is neigh_lookup. There are two others, both wrappers around neigh_lookup, that can either force the creation of a neighbour entry if the lookup fails or decide whether to create one according to an input parameter. Here is a brief description of the three routines:

neigh_lookup
neigh_lookup

检查要搜索的元素是否存在,成功时返回指向该元素的指针。

    结构邻居 *neigh_lookup(结构 neigh_table *tbl, const void *pkey,
                       结构体net_device *dev)
    {
        结构邻居*n;
        int key_len = tbl->key_len;
        u32 hash_val = tbl->hash(pkey, dev) & tbl->hash_mask;
        read_lock_bh(&tbl->lock);
        for (n = tbl->hash_buckets[hash_val]; n; n = n->next) {
            if (dev == n->dev &&
                !memcmp(n->primary_key, pkey, key_len)) {
                neigh_hold(n);
                NEIGH_CACHE_STAT_INC(表,点击数);
                休息;
            }
        }
        read_unlock_bh(&tbl->lock);
        返回n;
    }

Checks whether the element being searched for exists, and returns a pointer to it when successful.

    struct neighbour *neigh_lookup(struct neigh_table *tbl, const void *pkey,
                       struct net_device *dev)
    {
        struct neighbour *n;
        int key_len = tbl->key_len;
        u32 hash_val = tbl->hash(pkey, dev) & tbl->hash_mask;
        read_lock_bh(&tbl->lock);
        for (n = tbl->hash_buckets[hash_val]; n; n = n->next) {
            if (dev == n->dev &&
                !memcmp(n->primary_key, pkey, key_len)) {
                neigh_hold(n);
                NEIGH_CACHE_STAT_INC(tbl, hits);
                break;
            }
        }
        read_unlock_bh(&tbl->lock);
        return n;
    }
_ _neigh_lookup
_ _neigh_lookup

一个包装器,通过查找失败时以及设置输入标志时neigh_lookup调用来创建条目。neighbourneigh_create_ _neigh_lookupcreat

A wrapper around neigh_lookup that creates the neighbour entry by means of neigh_create when the lookup fails and when _ _neigh_lookup was invoked with the creat input flag set.

_ _neigh_lookup_errno
_ _neigh_lookup_errno

用于neigh_lookup查看条目是否存在,并neighbour 在查找失败时始终创建一个新实例。_ _neigh_lookup该功能与没有输入标志时基本相同creat

Uses neigh_lookup to see whether the entry exists, and always creates a new neighbour instance when the lookup fails. This function is basically the same as _ _neigh_lookup without the input creat flag.

第 28 章描述了另一个函数 , arp_find它是一个包装器_ _neigh_lookup,保留它是为了向后兼容,供遗留代码使用。另一个函数 ,neigh_lookup_nodev目前仅由 DECnet 使用。

Chapter 28 describes another function, arp_find, which is a wrapper around _ _neigh_lookup and is kept for backward compatibility, for use by legacy code. Another function, neigh_lookup_nodev, is currently used only by DECnet.

每个协议还维护一个单独的缓存和一组用于目标代理的关联查找 API。您可以在“充当代理”部分找到有关它们的更多详细信息。

Each protocol also maintains a separate cache and an associated set of lookup APIs used for destination proxying. You can find more details about them in the section "Acting As a Proxy."

定时器

Timers

邻近子系统使用多个计时器。有些是全局性的,而另一些则是在每个邻居的基础上创建的。有些定期运行,有些则仅在需要时启动。以下是定时器的简要概述,我们将在后面的部分中更详细地看到:

The neighboring subsystem uses several timers . Some are global, whereas others are created on a one-per-neighbor basis. Some run periodically, and others are started only when needed. The following is a brief overview of the timers we will see in more detail in later sections:

状态之间的转换 ( neighbour->timer )
Transitions between states ( neighbour->timer )

NUD 状态之间的一些转换是由时间的推移而不是系统中的事件驱动的。这些转变包括:

Some transitions between NUD states are driven by the passage of time rather than by events in the system. These transitions include:

NUD_REACHABLE NUD_DELAY NUD_STALE
From NUD_REACHABLE to NUD_DELAY or NUD_STALE

当经过一定时间没有从邻居发送或接收流量时,就会发生这种转换;相邻子系统自动怀疑邻居可能无法到达。

This transition takes place when a certain amount of time goes by without sending or receiving traffic from a neighbor; the neighboring subsystem automatically suspects that the neighbor may not be reachable.

NUD_DELAY NUD_PROBE NUD_REACHABLE
From NUD_DELAY to NUD_PROBE or NUD_REACHABLE

这是怀疑邻居可达性后的下一个状态;要么必须由外部事件确认,要么相邻子系统必须启动显式探测。定时器只是检测改变状态所需的条件并处理它。例如,我们在第 26 章的图 26-14 中看到,当TCP提供可达性确认时,如何调用。更新结构中的时间戳但不更改状态。相反,当该计时器检测到新的时间戳时,它会更改邻居的状态。neigh_confirmneigh_confirmneighbour

This is the next state after the neighbor's reachability is suspected; either it must be confirmed by an external event or the neighboring subsystem must launch an explicit probe. The timer simply detects the condition required to change state and takes care of it. For example, we saw in Figure 26-14 in Chapter 26 how neigh_confirm may be called when TCP provides confirmation of reachability. neigh_confirm updates a timestamp in the neighbour structure but does not change the state. Instead, when this timer detects the new timestamp, it changes the neighbor's state.

每个结构中的计时器neighbour控制这两个转换。它的回调被初始化为 使用创建条目neigh_timer_handler时。您可以在图 27-4和第 26 章的“可达性确认”部分中找到更多信息。neighbourneigh_alloc

A timer in each neighbour structure controls both of these transitions. Its callback is initialized to neigh_timer_handler when the neighbour entry is created with neigh_alloc. You can find more information on this in Figure 27-4, and in the section "Reachability Confirmation" in Chapter 26.

招标请求失败
Failed solicitation requests

如果在给定时间内未收到对询价请求的答复,则会发送新的询价。可以发送的请求请求的最大数量由该结构的字段给出,在第 29 章的“ neigh_parms 结构”部分中进行了描述。XXX _probesneigh_parms

在最后一次尝试失败后,邻居条目将移至 状态NUD_FAILED(见图27-13)。状态变为 后NUD_FAILED,就交给垃圾回收了计时器删除该条目。

If no answer to a solicitation request is received within a given amount of time, a new solicitation is sent. The maximum number of solicitation requests that can be sent is given by the XXX _probes fields of the neigh_parms structure, described in the section "neigh_parms Structure" in Chapter 29.

After the final failed attempt, the neighbor entry is moved to the NUD_FAILED state (see Figure 27-13). After the state becomes NUD_FAILED, it is up to the garbage collection timer to remove the entry.

垃圾收集 ( neigh_table->gc_timer )
Garbage collection ( neigh_table->gc_timer )

周期性定时器用于确保未使用的数据结构不会浪费内存。回调处理程序是neigh_periodic_timer. “垃圾收集”部分详细描述了垃圾收集机制。

neigh_periodic_timerreachable_time还每 300 秒将结构体中的值更新neighbour为随机值[ * ]。该值是随机的而不是固定的,因为您希望避免太多条目同时过期:在相当大的网络中,这可能会造成流量和 CPU 使用率的激增。

A periodic timer is used to make sure that no memory is wasted by unused data structures. The callback handler is neigh_periodic_timer. The section "Garbage Collection" describes the garbage collection mechanism in detail.

neigh_periodic_timer also updates the value of reachable_time in the neighbour structure to a random value[*] every 300 seconds. The value is random rather than fixed because you want to avoid having too many entries expiring at the same time: in a pretty big network, that could create a burst of traffic and CPU usage.

代理 ( neigh_table->proxy_timer )
Proxy ( neigh_table->proxy_timer )

对于可能收到大量征求请求的代理,延迟处理请求可能很有用。该定时器用于强制执行延迟。请参阅“招标请求的延迟处理”部分。

For a proxy that might receive a large number of solicitation requests, it may be useful to delay the processing of requests. This timer is used to enforce the delay. See the section "Delayed Processing of Solicitation Requests."

邻居结构的引用计数

Reference Counts on neighbour Structures

许多涉及创建邻居的内核子系统都neighbour在某些数据结构中保留对该结构的引用;例如,路由子系统就是这样做的。因此,该 neighbour结构包括一个名为 的引用计数 refcnt,它分别通过 neigh_hold和递增和递减neigh_release

Many kernel subsystems involved in the creation of neighbors keep a reference to the neighbour structure in some data structure; the routing subsystem does so, for instance. Therefore, the neighbour structure includes a reference count named refcnt, which is incremented and decremented with neigh_hold and neigh_release, respectively.

增加邻居引用计数的最常见事件是数据包传输。每当发送数据包时,关联的sk_buff缓冲区都会保存对结构的引用neighbour,因此neighbour->refcnt 会递增以确保传输可以顺利完成。一旦数据包被发送,计数就会再次递减。

The most common event that increments a neighbor reference count is a packet transmission. Whenever a packet is sent out, the associated sk_buff buffer holds a reference to a neighbour structure, so neighbour->refcnt is incremented to make sure that the transmission can complete without problems. Once the packet has been transmitted, the count is decremented again.

这是短期参考的一个例子;其他的可以持续更长的时间。一个例子是路由表缓存保存的引用(在 IPv4 和 IPv6 [ * ]下),如图27-10所示。

This was an example of a short-term reference; others can last significantly longer. One example is the reference kept by the routing table cache (under both IPv4 and IPv6[*]), as depicted in Figure 27-10.

每次启动每个邻居计时器时,引用计数也会增加,如下面的快照所示neigh_update

The reference count is also incremented every time a per-neighbor timer is fired up, as shown in the following snapshot taken from neigh_update:

    如果(新&NUD_IN_TIMER){
            neigh_hold(邻居);
            neigh->timer.expires = jiffies +
                                   ((新&NUD_REACHABLE)?
                                   neigh->parms->reachable_time : 0);
            add_timer(&neigh->计时器);
    }
    if (new & NUD_IN_TIMER) {
            neigh_hold(neigh);
            neigh->timer.expires = jiffies +
                                   ((new & NUD_REACHABLE) ?
                                   neigh->parms->reachable_time : 0);
            add_timer(&neigh->timer);
    }

当由于某种原因要删除某个条目时(请参阅neigh_ifdown“与其他子系统的交互”部分),但由于有人仍然保留对它的引用而无法释放该条目,则将其标记为死亡并设置为 1。 neighbour->dead垃圾收集计时器很快就会处理它,如“垃圾收集”部分中所述。

When an entry is to be removed for some reason (see neigh_ifdown in the section "Interactions with Other Subsystems") but it cannot be freed because someone still holds a reference to it, it is marked as dead with neighbour->dead set to 1. The garbage collection timer will soon take care of it, as explained in the section "Garbage Collection."

创建邻居条目

Creating a neighbour Entry

与大多数缓存项一样,条目的创建neighbour 是事件驱动的:当系统需要邻居并且存在缓存未命中时,就会创建一个实例。具体来说,当发生以下情况之一时,将创建一个新实例:

Like most cached items, the creation of neighbour entries is event driven: an instance is created when the system needs a neighbor and there is a cache miss. Specifically, a new instance is created when one of the following takes place:

传输请求
Transmission request

当向未知 L2 地址的主机发出传输请求时,需要解析该地址。这是最常见的情况,如图 27-13(a)所示。当目标主机与发送方没有直接连接时,解析的二层地址将是下一跳网关的地址,而不是目标主机的地址。

When there is a transmission request toward a host whose L2 address is not known, the address needs to be resolved. This is the most common case and is depicted in Figure 27-13(a). When the target host is not directly connected to the sender, the L2 address to resolve will be that of the next hop gateway, not that of the target host.

接收征集请求
Reception of a solicitation request

由于发送请求的主机在该请求中标识了自己,因此接收者会在假设两个系统之间即将进行通信的情况下自动创建缓存条目。(涉及ARP的详细内容请参见第28章28-2)。然而,以这种方式(被动)获知的信息并不被认为与通过明确的征求请求和答复获悉的信息具有权威性(更多详细信息,请参阅第 26 章中的“ NUD 状态之间的转换” 部分)。

Because the host sending the request identifies itself in that request, the recipient automatically creates a cache entry on the assumption that communication between the two systems is imminent. (For details involving ARP, see Figure 28-2 in Chapter 28). However, information learned in this way (passively) is not considered as authoritative as information learned with an explicit solicitation request and reply (see the section "Transitions Between NUD States" in Chapter 26 for more details).

手动编码
Manual coding

管理员可以通过命令创建缓存条目,如第29章“邻居系统管理ip neigh add部分所述。

An administrator can create a cache entry through an ip neigh add command, as described in the section "System Administration of Neighbors" in Chapter 29.

当这些事件之一发生,并且对相邻子系统缓存的查询返回未命中时,相邻协议尝试解析关联(通常通过发送请求请求)并将结果条目存储在每个协议缓存中neighbour

When one of these events happens, and a query to the neighboring subsystem cache returns a miss, the neighboring protocol tries to resolve the association (normally by sending a solicitation request) and stores the resulting neighbour entry in the per-protocol cache.

neigh_create 函数的参数

The neigh_create Function's Parameters

现在我们知道什么触发了结构的创建 neighbour,我们可以看看与其创建相关的主要函数。

Now that we know what triggers the creation of a neighbour structure, we can look at the main functions involved with its creation.

数据结构本身是用 创建的neigh_create,其返回值是指向数据结构的指针neighbour。这是原型和三个输入参数的描述:

The data structure itself is created with neigh_create, whose return value is a pointer to the neighbour data structure. Here is the prototype and a description of the three input parameters:

    结构邻居 * neigh_create(结构 neigh_table *tbl, const void *pkey,
                          结构体net_device *dev)
    struct neighbour * neigh_create(struct neigh_table *tbl, const void *pkey,
                          struct net_device *dev)
tbl
tbl

标识所使用的邻居协议。该参数的设置方式很简单:如果从 IPv4 代码(即从 )调用它,则将arp_rcv其设置为arp_tbl,等等。

Identifies the neighboring protocol used. The way this parameter is set is simple: if it is being called from IPv4 code (i.e., from arp_rcv) it is set to arp_tbl, etc.

pkey
pkey

L3 地址。之所以调用它,pkey是因为它将用作缓存查找的搜索键的字段。

L3 address. It is called pkey because it is the field that will be used as the search key for the cache lookup.

dev
dev

与条目关联的设备。由于每个neighbour条目都与一个 L3 地址相关联,而后者始终与一个设备相关联,因此实例neighbour始终与一个设备相关联。

Device the entry is associated with. Because each neighbour entry is associated with an L3 address and the latter is always associated with a device, it follows that a neighbour instance is always associated with a device.

新的neighbour数据结构由 分配 neigh_alloc,它还用于初始化一些参数,例如嵌入式计时器、引用计数、指向关联 neigh_table(相邻协议)结构的指针以及有关结构分配数量的全局统计信息neighbour

New neighbour data structures are allocated with neigh_alloc, which is also used to initialize a few parameters such as the embedded timer, the reference count, a pointer to the associated neigh_table (neighboring protocol) structure, and global statistics about the number of neighbour structure allocations.

neigh_alloc使用在子系统初始化时创建的内存池(请参阅“协议初始化和清理”部分)。仅当当前分配的结构数量大于某个可配置阈值,并且最重要的是,垃圾收集器(通过 )尝试释放一些内存失败时,该函数才会失败(请参阅“同步清理:neigh_forced_gc 函数neigh_forced_gc部分) )。

neigh_alloc uses a memory pool created at subsystem initialization time (see the section "Protocol Initialization and Cleanup"). The function fails only if the number of structures currently allocated is greater than some configurable threshold and, on top of that, an attempt by the garbage collector (via neigh_forced_gc) to free some memory failed (see the section "Synchronous cleanup: the neigh_forced_gc function").

pkey借助 复制到数据结构中key_len,它提供了要复制的数据的大小。这是必要的,因为这些neighbour 结构由与协议无关的高速缓存查找例程使用,并且各种相邻协议使用不同大小的地址。

pkey is copied into the data structure with the help of key_len, which provides the size of the data to be copied. This is necessary because the neighbour structures are used by protocol-independent cache lookup routines and the various neighboring protocols use addresses of different sizes.

        memcpy(n->primary_key, pkey, key_len);
        memcpy(n->primary_key, pkey, key_len);

此外,由于该neighbour条目保存了对该net_device结构的引用dev,因此内核会增加后者的引用计数,以dev_hold确保在该neighbour结构不再存在之前该设备不会被删除。

Also, because the neighbour entry holds a reference to the net_device structure dev, the kernel increases the reference count on the latter with dev_hold to make sure the device will not be removed until the neighbour structure ceases to exist.

邻居初始化

Neighbor Initialization

结构体的初始化有两种neighbour :一种由相邻协议完成,另一种由设备完成。

There are two kinds of initialization for a neighbour structure: one done by the neighboring protocol and one done by the device.

        if (tbl->构造函数 && (错误 = tbl->构造函数(n)) < 0) {
            rc = ERR_PTR(错误);
            转到out_neigh_release;
        }
        if (tbl->constructor && (error = tbl->constructor(n)) < 0) {
            rc = ERR_PTR(error);
            goto out_neigh_release;
        }

协议的初始化是由函数参数调用的函数执行的neigh_table->constructor,如此处所示tbl第 28 章解释了 ARP 构造函数如何完成这项工作。

The protocol's initialization is carried out by the neigh_table->constructor function invoked, as shown here, from the function's tbl parameter. Chapter 28 explains how the ARP constructor does the job.

设备初始化是通过neigh_setup虚函数完成的:

Device initialization is done through the neigh_setup virtual function:

        if (n->parms->neigh_setup &&
            (错误 = n->parms->neigh_setup(n)) < 0) {
            rc = ERR_PTR(错误);
            转到out_neigh_release;
        }
        if (n->parms->neigh_setup &&
            (error = n->parms->neigh_setup(n)) < 0) {
            rc = ERR_PTR(error);
            goto out_neigh_release;
        }

该功能实际上仅由少数设备定义。例如,shaper 虚拟设备( drivers/net/shaper.c中的一段旧代码,已被流量控制子系统废弃,但需要向后兼容)使用 setup 函数来确保设备与ARP 提供的结构的特定实例(请参阅“ neigh->ops 的初始化neigh_ops”部分)。出于类似原因,某些 WAN 设备使用设置功能。

This function is actually defined by only a few devices. For instance, the shaper virtual device (an old piece of code in drivers/net/shaper.c that has been rendered obsolete by the Traffic Control subsystem but is needed for backward compatibility) uses the setup function to make sure the device is associated with a specific instance of the neigh_ops structures provided by ARP (see the section "Initialization of neigh->ops"). Some WAN devices use a setup function for similar reasons.

neigh_create函数通过设置条目的confirmed字段来指示邻居是可达的而结束。通常,该字段通过可达性证明进行更新,并设置为以 表示的当前时间jiffies。但在这里,在创建时,函数减去少量时间(值的一半 reachable_time)以使状态移动得比NUD_STALE平常快一点,并需要可达性证明。

The neigh_create function ends by setting the entry's confirmed field to indicate that the neighbor is reachable. Normally, this field is updated by a proof of reachability and is set to the current time expressed in jiffies. But here, at the point of creation, the function subtracts a small amount of time (one-half the value reachable_time) to make the state move to NUD_STALE a little faster than usual and to require proof of reachability.

        n->确认 = jiffies - (n->parms->base_reachable_time<<1);
        n->confirmed = jiffies - (n->parms->base_reachable_time<<1);

一旦条目被初始化,就会使用相邻协议提供的哈希函数将其添加到主缓存中。

Once the entry has been initialized, it is added to the main cache using the hash function provided by the neighboring protocol.

邻居删除

Neighbor Deletion

neighbour可以出于三个主要原因删除数据结构:

A neighbour data structure can be removed for three main reasons:

  • 内核尝试将数据包发送到无法访问的主机。发生这种情况的原因有很多:主机出现故障、电缆被拔出、无线设备超出范围、网络配置损坏,或者有人为不存在的主机手动创建了条目。无论原因是什么,邻近子系统都会注意到故障并将关联的neighbour结构置于状态中NUD_FAILED,以便通过异步垃圾收集来清理它,如“异步清理:neigh_periodic_timer 函数”一节中所述。 ”。

  • The kernel tries to send a packet to a host that is not reachable. There are many reasons this could happen: the host went down, its cable came unplugged, it was a wireless device that moved out of range, its network configuration got corrupted, or somebody manually created an entry for a nonexistent host. Whatever the cause, the neighboring subsystem notices the failure and puts the associated neighbour structure into the NUD_FAILED state so that it is cleaned up by asynchronous garbage collection, described in the section "Asynchronous cleanup: the neigh_periodic_timer function ."

  • 与邻居结构关联的主机已更改其 L2 地址(可能是因为其 NIC 已更换),但仍具有相同的 L3 配置。因此,该neighbour结构具有过时的 L2 地址。具有过时邻居条目的主机必须将其置于该NUD_FAILED状态并创建一个新邻居条目。[ * ]

  • The host associated with the neighbor structure has changed its L2 address (perhaps because its NIC was replaced) but still has the same L3 configuration. Thus, the neighbour structure has an outdated L2 address. A host with an outdated neighbor entry has to put it into the NUD_FAILED state and create a new one.[*]

  • 该结构会变旧,内核需要它的内存。因此,它会被垃圾收集删除,如“同步清理:neigh_forced_gc 函数”部分所述。

  • The structure gets old and the kernel needs its memory. It is therefore removed by garbage collection, described in the section "Synchronous cleanup: the neigh_forced_gc function."

到 的转换由第 26 章“ NUD 状态之间的转换NUD_FAILED部分中介绍的 NUD 算法负责。异步垃圾收集由与计时器关联的函数执行(有关更多详细信息,请参阅“计时器”和“垃圾收集”部分)。neigh_periodic_timerneigh_table->gc_timer

The transition to NUD_FAILED is taken care of by the NUD algorithm introduced in the section "Transitions Between NUD States" in Chapter 26. Asynchronous garbage collection is performed by the neigh_periodic_timer function, which is associated with the neigh_table->gc_timer timer (see the sections "Timers" and "Garbage Collection" for more details).

仅当结构的引用计数变为零时,才会删除该结构。因此,执行删除的函数neigh_destroy只能从 调用neigh_release,每次释放对结构的引用时都会调用该函数。neigh_release 减少结构体的引用计数,并neigh_destroy在计数减至零时调用实际删除结构体:

A structure is removed only when its reference count goes to zero. Thus, the function that carries out the deletion, neigh_destroy, is called only from neigh_release, which is called every time a reference to a structure is released. neigh_release decrements the structure's reference count and calls neigh_destroy to actually remove the structure when the count goes down to zero:

    静态内联无效 neigh_release(struct neigh *neigh)
    {
            if (atomic_dec_and_test(&neigh->refcnt))
                    neigh_destroy(邻居);
    }
    static inline void neigh_release(struct neighbour *neigh)
    {
            if (atomic_dec_and_test(&neigh->refcnt))
                    neigh_destroy(neigh);
    }

neigh_destroy执行以下任务:

neigh_destroy carries out the following tasks:

  • 停止任何待处理的计时器。这是一项腰带和吊带预防措施。理论上,定时器在执行时不应该处于挂起状态,因为调用neigh_destroy所需的条件 是引用计数值为 0,而定时器在运行时始终持有引用。neigh_releaseneigh_destroy

  • Stops any pending timer. This is a belt-and-suspenders precaution. In theory, no timer should be pending when executing neigh_destroy because the condition required by neigh_release to invoke neigh_destroy is a reference count value of 0, and timers always hold a reference when running.

  • 释放对外部数据结构的任何引用,例如关联的设备和缓存的 L2 标头。参见图 27-127-10

    本章后面的“ L2 标头缓存”部分解释了缓存的用途,并显示了该neighbour结构与hh_cache包含标头的结构之间的关系。每个hh_cache结构都与一个条目严格耦合,因此一旦条目被删除或标记neighbour就不应再使用。因此,当删除条目时,如果引用计数允许,则它引用的任何结构都将从缓存中取消链接并释放,并将缓存标头中的字段设置为(有关该功能,请参阅“neighbourNUD_FAILEDneighbourhh_cacheneigh_destroyhh_cache->hh_outputneigh_blackhole用于 neigh->output 的例程”)。此后,任何使用该条neighbour目的传输尝试都将失败,并且数据包将被丢弃。在 L3 层,丢弃数据包的结果可以在“邻居之间的交互”部分中看到协议和 L3 传输功能。”

  • Releases any references to external data structures, such as the associated device and cached L2 headers. See Figures 27-1 and 27-10.

    The section "L2 Header Caching," later in this chapter, explains the purpose of the cache and shows the relationship between the neighbour structure and the hh_cache structures that contain the headers. Each hh_cache structure is strictly coupled with a neighbour entry and therefore should not be used once the neighbour entry has been removed or marked NUD_FAILED. Thus, when a neighbour entry is deleted, any hh_cache structures to which it refers are unlinked from the cache and freed if their reference counts allow it, and neigh_destroy sets the hh_cache->hh_output field in the cached header to neigh_blackhole (for that function, see the section "Routines used for neigh->output"). After this, any transmission attempt using the neighbour entry will silently fail and the packet will be dropped. At the L3 layer, the results of dropping the packet can be seen in the section "Interaction Between Neighboring Protocols and L3 Transmission Functions."

  • 如果destructor相邻协议提供了方法,则执行该方法以使协议有机会进行自己的清理。

  • If a destructor method has been provided by the neighboring protocol, executes it to give the protocol a chance to do its own cleanup.

  • 如果arp_queue队列不为空,则清除它(即删除其所有元素)。在“出口排队arp_queue”部分中进行了描述。

  • If the arp_queue queue is not empty, purges it (i.e., removes all of its elements). arp_queue is described in the section "Egress Queuing."

  • neighbour递减指示主机使用的条目数的全局计数器。

  • Decrements the global counter indicating the number of neighbour entries used by the host.

  • 释放neighbour数据结构(即,将其返回到内存池)。

  • Frees the neighbour data structure (i.e., gives it back to its memory pool).

垃圾收集

Garbage Collection

垃圾收集是指消除不再使用的资源的过程。与许多 Linux 内核子系统(网络等)一样,相邻子系统 维护一个定期运行的计时器,并在计时器到期时执行一个函数,以清理未使用的数据结构。

Garbage collection refers to the process of eliminating resources that are not in use anymore. Like many Linux kernel subsystems (networking and others), the neighboring subsystem maintains a timer that runs periodically and executes a function whenever the timer expires, to clean up the unused data structures.

垃圾收集邻近基础设施使用的算法 有两个主要组成部分:

The garbage collection algorithm used by the neighboring infrastructure has two main components:

同步清理
Synchronous cleanup

当相邻基础设施需要分配新neighbour结构并且此类结构的内存池用完时,就会立即发生这种情况。

This takes place immediately when the neighboring infrastructure needs to allocate a new neighbour structure and the memory pool for such structures is used up.

异步清理
Asynchronous cleanup

此操作会定期进行,以移除neighbour在一定时间内未使用的结构。该时间是可配置的并存储在gc_staletime变量中。“定时器neigh_periodic_timer”部分中描述的函数强制执行此规则。

This takes place periodically to remove neighbour structures that have not been used for a certain amount of time. This time is configurable and is stored in the gc_staletime variable. The neigh_periodic_timer function, described in the section "Timers," enforces this rule.

选择这个相对复杂的系统是因为,就相邻子系统而言,设计者认为它比简单的设计(例如在引用计数降至零时删除结构)更有效。虽然异步清理尝试释放没有进一步价值的结构,但同步清理尝试牺牲一些不太需要的条目来释放一些内存。因此,在两种类型的清理中,用于选择合格结构的标准是不同的。

This relatively complex system was chosen because, in the case of the neighboring subsystem, the designers thought it would be more efficient than simpler designs such as deleting a structure the moment its reference count went down to zero. While the asynchronous cleanup tries to free structures that have no further value, the synchronous cleanup tries to sacrifice some of the less-needed entries to free some memory. Therefore, the criteria used to select the eligible structures are different in the two types of cleanup.

有趣的是,异步清理也可以由外部子系统触发。例如,当路由子系统无法将新的路由条目插入其缓存时,它会尝试删除未使用的缓存条目(请参阅第33章rt_intern_hash中的函数描述),这也间接导致结构被释放。neighbour

It is interesting to note that an asynchronous cleanup can be triggered by an external subsystem, too. For instance, when the routing subsystem cannot insert a new routing entry into its cache, it tries to remove unused cache entries (see the description of the rt_intern_hash function in Chapter 33), which indirectly causes neighbour structures to be freed, too.

调整垃圾收集行为的参数是:

The parameters that tune garbage collection behavior are:

  • 来自neigh_table

    • gc_interval

    • gc_thresh1, gc_thresh2,gc_thresh3

    • last_flush

    • gc_timer

  • From neigh_table:

    • gc_interval

    • gc_thresh1, gc_thresh2, gc_thresh3

    • last_flush

    • gc_timer

  • neigh_parms:

    • gc_staletime

  • From neigh_parms:

    • gc_staletime

以下两节解释了它们的含义和用途。有关这些变量的信息,请参阅第 29 章中的“ neigh_table 结构”部分、“neigh_parms 结构”部分和表 29-3 。

The following two sections explain their meaning and use. Also consult the section "neigh_table structure," the section "neigh_parms structure," and Table 29-3 in Chapter 29 for information on these variables.

图 27-727-8neigh_periodic_timer显示了和的行为neigh_forced_gc,这两个例程将在接下来的两节中描述。

Figures 27-7 and 27-8 show the behavior of neigh_periodic_timer and neigh_forced_gc, the two routines described in the next two sections.

同步清理:neigh_forced_gc 函数

Synchronous cleanup: the neigh_forced_gc function

图 27-7显示了 的内部结构neigh_forced_gc

Figure 27-7 shows the internals of neigh_forced_gc.

neigh_forced_gc 函数

图 27-7。neigh_forced_gc 函数

Figure 27-7. neigh_forced_gc function

如果没有内存来分配新实例,则主机无法将任何数据包传输到缓存中neighbour尚无结构的邻居。如果没有处理这种情况的策略,后果将非常糟糕:除非另一个结构因某种原因被删除,neighbour否则无法与新主机进行通信。neighbour

If there is no memory to allocate a new neighbour instance, the host cannot transmit any packet to neighbors for which there is not already a neighbour structure in the cache. Without a policy to handle this case, the consequences would be pretty bad: no communication could take place with a new host until another neighbour structure happened to be removed for some reason.

我们已经看到该neigh_alloc函数负责在相邻子系统中分配内存,它是启动同步垃圾收集的自然位置。要确定是否存在危险情况并在内存实际耗尽之前进行垃圾回收, 请检查两个名为和neigh_alloc的变量。(另一个变量 ,当前在内核中声明但未使用。)gc_thresh2gc_thresh3gc_thresh1

The neigh_alloc function, which we have seen is responsible for allocating memory in the neighboring subsystem, is the natural place to kick off synchronous garbage collection. To determine whether there is a danger situation and do garbage collection before memory is actually exhausted, neigh_alloc checks two variables named gc_thresh2 and gc_thresh3. (Another variable, gc_thresh1, is currently declared in the kernel but is not used.)

当实例数neighbour 大于时gc_thresh3,该neigh_alloc函数强制进行垃圾回收。当实例数介于gc_thresh2和 之间 时gc_thresh3,如果上一次垃圾回收至少提前 5 秒发生,则强制进行垃圾回收。第二次检查的原因是为了限制垃圾收集所花费的时间。

When the number of neighbour instances is greater than gc_thresh3, the neigh_alloc function forces garbage collection. When the number of instances is between gc_thresh2 and gc_thresh3, garbage collection is forced if the previous garbage collection took place at least 5 seconds earlier. The reason for the second check is to rate limit the time spent doing garbage collection.

gc_thresh2和的默认值gc_thresh3分别为 512 和 1,024。这些看起来很大,但旨在支持代理 ARP。如果没有代理 ARP 服务器,每个主机通常只会为少数本地计算机和路由器创建 ARP 条目,因此它永远不会接近这些阈值。但是,当使用代理 ARP 时,主机会请求更多的 L3 地址,因为它们对默认网关的依赖较少。代理 ARP 服务器接收到请求请求后,会间接创建 neighbour发送者地址的条目。请参阅前面的“创建邻居条目”部分以及arp_process 中的描述第28章。在中等规模的网络中,阈值非常安全,缓存不太可能溢出。

The default values for gc_thresh2 and gc_thresh3 are 512 and 1,024, respectively. These look like big numbers, but are designed to support proxy ARP. Without a proxy ARP server, each host usually creates ARP entries for only a few local machines and the router, so it would never get near those thresholds. But when proxy ARP is in use, hosts request more L3 addresses because they rely less on the default gateway. The reception of a solicitation request by the proxy ARP server leads to the indirect creation of a neighbour entry for the sender's address. See the earlier section "Creating a neighbour Entry," and the description of arp_process in Chapter 28. In a medium-size network, the thresholds are pretty safe and the cache is not likely to overflow.

调用同步清理的例程是,如图27-7neigh_forced_gc所示。从哈希表中删除所有符合条件的元素。合格元素是满足以下两个要求的元素:neigh_forced_gc

The routine invoked to do synchronous cleanup is neigh_forced_gc, which is depicted in Figure 27-7. neigh_forced_gc removes all of the eligible elements from the hash table. Eligible elements are the ones that meet both of the following requirements:

  • 引用计数为 1,意味着没有人正在使用该元素,并且持有剩余引用的子系统可以自由删除该元素。

  • The reference count is 1, meaning that nobody is using the element, and the subsystem holding the remaining reference is free to delete the element.

  • 该元素不处于该NUD_PERMANENT 状态。该状态下的元素已静态配置,因此不会过期。

  • The element is not in the NUD_PERMANENT state. Elements in that state have been statically configured and therefore do not expire.

元素添加到neigh_create哈希表中存储桶列表的头部。

Elements are added by neigh_create at the head of the bucket's lists in the hash table.

异步清理:neigh_periodic_timer 函数

Asynchronous cleanup: the neigh_periodic_timer function

图 27-8显示了 的内部结构neigh_periodic_timer

Figure 27-8 shows the internals of neigh_periodic_timer.

neigh_periodic_timer 函数

图 27-8。neigh_periodic_timer 函数

Figure 27-8. neigh_periodic_timer function

gc_timer是一个定期到期的按协议计时器。当计时器到期时,它会调用垃圾收集例程neigh_periodic_timer。内核实际上调用结构体字段中指定的函数neigh_table(每个相邻协议都存在一个函数),因此理论上每个协议都可以有自己的垃圾收集处理程序实现,但实际上该字段被初始化为相同的例程跨函数中的所有协议neigh_table_init

gc_timer is a per-protocol timer that expires periodically. When the timer expires, it invokes the garbage collection routine neigh_periodic_timer. The kernel actually invokes a function specified in a field of the neigh_table structure (one of which exists for each neighboring protocol), so each protocol could theoretically have its own implementation of the garbage collection handler, but in practice the field is initialized to the same routine across all the protocols in the neigh_table_init function.

多久gc_timer过期取决于hash_buckets表的大小:因为neigh_periodic_timer每次调用时只扫描表的一个存储桶,并且因为每base_reachable_time/2 秒扫描一次整个表(通过设计选择),因此计时器必须设置为每 ( base_reachable_time/2)/过期number_of_buckets

How often gc_timer expires depends on the size of the hash_buckets table: because neigh_periodic_timer scans only one bucket of the table every time it is called, and because the whole table is scanned (by design choice) once every base_reachable_time/2 seconds, it follows that the timer must be set to expire every (base_reachable_time/2)/number_of_buckets.

每次neigh_periodic_timer调用时,由于 的字段 ,它都会记住扫描的最后一个存储桶neigh_tablehash_chain_gc并扫描下一个。

Every time neigh_periodic_timer is called, it remembers the last bucket scanned, thanks to neigh_table's field, hash_chain_gc, and scans the following one.

每次确认邻居的可达性时都会更新时间戳neigh->confirmed,例如通过调用,正如我们在第 26 章的“可达性确认neigh_confirm部分中看到的那样。尽管它的名字暗示了这一点,但每次使用该结构时(即,将每个数据包传输到邻居时),时间戳都不会更新。因此,有可能在某个时刻, 代表一个更新的时间戳,标记该结构的最后一次使用。因此,如果需要则更新(即,如果大于neigh->usedneighbourneigh->confirmedneighbourneigh_periodic_timerneigh->usedneigh->confirmedneigh->used)。保持neigh->used更新很重要,因为这是用来消除旧条目的时间戳neigh_periodic_timer

The neigh->confirmed timestamp is updated every time the reachability of the neighbor is confirmed, for example, by calling neigh_confirm, as we saw in the section "Reachability Confirmation" in Chapter 26. Even though its name suggests it, the neigh->used timestamp is not updated every time the neighbour structure is used (i.e., with the transmission of each packet to the neighbor). Because of this, it is possible that at some point, neigh->confirmed represents a more updated timestamp marking the last use of the neighbour structure. For this reason, neigh_periodic_timer updates neigh->used if that is needed (i.e., if neigh->confirmed is greater than neigh->used). It is important to keep neigh->used updated because that's the timestamp used by neigh_periodic_timer to eliminate old entries.

如图27-8 所示,标记为删除的合格元素neigh_periodic_timer满足以下两个条件:

As Figure 27-8 shows, eligible elements marked for deletion by neigh_periodic_timer meet both of the following criteria:

  • 引用计数为1,表示不再使用。

  • The reference count is 1, meaning it is no longer used.

  • 该条目要么处于NUD_FAILED 表示解析失败的状态,要么根本没有使用超过可配置的gc_staletime时间。

  • The entry either is in the NUD_FAILED state, which means that resolution failed, or has simply not been used for more than the configurable gc_staletime time.

担任代理人

Acting As a Proxy

第 26 章中的“代理邻居协议”部分描述了代理为何有用并给出了一些使用示例。它还显示了相邻协议决定代理是否处理给定请求请求的标准。本节详细介绍代理的实现

The section "Proxying the Neighboring Protocol" in Chapter 26 described why proxies are useful and gave a few examples of their use. It also showed the criteria by which neighboring protocols decide whether a given solicitation request is taken care of by the proxy. This section goes into detail on the implementation of proxying .

我们在第 26 章的“代理所需的条件”一节中看到 ,可以配置两种代理:主机可以代理在特定 NIC 上收到的所有请求(每设备代理),或者更有选择性地可以代理请求对于在特定 NIC 上接收到的特定地址(按目标代理)。

We saw in the section "Conditions Required by the Proxy" in Chapter 26 that two kinds of proxying can be configured: a host either can proxy all requests received on a particular NIC (per-device proxying) or, more selectively, can proxy requests for a particular address received on a particular NIC (per-destination proxying).

第 26 章图 26-8 中所示的优先级在特定于协议的代码中强制执行。ARP的实现见第28章,大家可以看看IPv6的实现例程。“每设备代理和每目标代理”部分也更详细地介绍了这两种类型的代理。neigh_recv_ns

The precedence shown in Figure 26-8 in Chapter 26 is enforced in protocol-specific code. ARP's implementation is shown in Chapter 28, and you can look at the routine neigh_recv_ns for IPv6's implementation. The section "Per-Device Proxying and Per-Destination Proxying" also goes into more detail about these two types of proxying.

在深入研究代码之前,让我先介绍一下广泛使用的命名约定。相邻子系统包含成对的函数和数据结构,其名称仅在是否存在初始 p 上有所不同(例如,neigh_lookuppneigh_lookup)。p代表 proxy_ 由于代理拦截的地址的处理方式不同,因此有一组专用的函数来操作它们。

Before digging into the code, let me introduce a naming convention used extensively there. The neighboring subsystem contains pairs of functions and data structures whose names differ only in the presence or absence of an initial p (e.g., neigh_lookup versus pneigh_lookup). The p stands for proxy. Because addresses intercepted by proxies are handled differently, there is a dedicated set of functions to manipulate them.

延迟处理招标请求

Delayed Processing of Solicitation Requests

代理处理的请求请求可以立即处理,也可以在可配置的延迟后处理。引入延迟的主要原因是给予代理条目比更权威的主机(例如所请求的 L3 地址的真正所有者)更低的优先级。发送请求的主机会在一小段时间内锁定第一个回复,并等待另一个请求到达,以强制执行优先级;详细内容参见第28章“最终公共处理”部分。

Solicitation requests handled by the proxy can be processed right away or after a configurable delay. The main reason for introducing a delay is to give proxy entries lower priority than more authoritative hosts, such as the real owners of the solicited L3 addresses. A host that sends a request locks the first reply for a small amount of time and waits in case another arrives, to enforce the priority; details are described in the section "Final Common Processing" in Chapter 28.

应用的延迟是 0 和配置值之间的随机值proxy_delay(请参阅函数pneigh_enqueue)。使用随机值可以降低多个主机同步请求的可能性以及可能导致的拥塞。例如,如果某个站点发生电源故障,并且在恢复后同时启动数百台主机,则所有主机可能都会请求同一组服务器或默认网关。随机延迟可以消除由此产生的流量峰值。

The delay applied is a random value between 0 and the configured value proxy_delay (see the function pneigh_enqueue). The use of a random value reduces the likelihood of synchronized requests by multiple hosts, and the congestion that could result. For example, if a power failure occurs at a site, and upon recovery it powers up hundreds of hosts at the same time, all of the hosts probably solicit the same set of servers or default gateways. A random delay smoothes out the spike in traffic that would result.

为了应用延迟,相邻子系统创建一个存储入口请求请求的队列和一个计时器。计时器在配置的延迟过去后到期,并触发特殊处理程序的执行,该处理程序将元素从队列中出列。然后对它们进行处理,就好像它们刚刚从网络接收一样。

To apply a delay, the neighboring subsystem creates a queue storing ingress solicitation requests, and a timer. The timer expires after the configured delay has passed and triggers the execution of a special handler that dequeues the elements from the queue. They are then processed as if they had just been received from the network.

图 27-9描绘了刚刚描述的模型。

Figure 27-9 depicts the model just described.

处理代理延迟涉及的主要变量和虚函数有:

The major variables and virtual functions involved in handling the proxy delay are:

  • 来自neigh_table(每个协议参数)

    proxy_queue

    临时缓冲入口请求的队列。元素被添加到列表的末尾。当proxy_queue列表达到 proxy_qlen(稍后讨论)中指定的最大长度时,新元素将被删除;它们不会取代最旧的。

    proxy_timer

    定时器用于强制延迟。计时器由 初始化neigh_table_init,默认处理程序为 neigh_proxy_process

    proxy_redo

    处理出队请求的函数。如图27-9所示,它仅包含对处理新接收的数据包的同一函数的调用。

    协议代理处理程序的通用模型

    图 27-9。协议代理处理程序的通用模型

  • From neigh_table (per-protocol parameters)

    proxy_queue

    Queue where the ingress solicitation requests are temporarily buffered. Elements are added to the end of the list. When the proxy_queue list has reached the maximum length specified in proxy_qlen (discussed later), new elements are dropped; they do not replace the oldest ones.

    proxy_timer

    Timer used to enforce the delay. The timer is initialized by neigh_table_init and the default handler is neigh_proxy_process.

    proxy_redo

    Function that processes the dequeued requests. As shown in Figure 27-9, it consists of just a call to the same function that processes freshly received packets.

    Figure 27-9. Generic model of a protocol proxy handler

  • 来自neigh_parms(每个设备参数)

    proxy_delay

    用于加载计时器的可配置延迟。

    proxy_qlen

    临时存储队列的最大长度。

  • From neigh_parms (per-device parameters)

    proxy_delay

    Configurable delay used to load the timer.

    proxy_qlen

    Maximum length of the temporary storage queue.

有关最重要数据结构的更详细的逐字段描述,请参阅第 29 章

For a more detailed, field-by-field description of the most important data structures, refer to Chapter 29.

对于每个协议,都有一个neigh_table->proxy_queue由使用该协议的所有 NIC 共享的专用队列 ( )。新元素被添加到proxy_queuewith中pneigh_enqueueproxy_queue是一个双向链表,它保持排序,以便计时器更容易按时间顺序处理待处理的请求。如果在添加新元素proxy_queue时已包含任何请求pneigh_enqueue,则该函数将重新启动计时器,使其在当前计划的超时或新元素所需的超时(以先到者为准)到期。

For each protocol, there is a private queue (neigh_table->proxy_queue) shared by all the NICs using that protocol. New elements are added to proxy_queue with pneigh_enqueue. proxy_queue is a doubly linked list that is kept sorted to make it easier for the timer to handle the pending requests in chronological order. If proxy_queue already contains any requests when pneigh_enqueue adds a new one, the function restarts the timer to make it expire either at the currently scheduled timeout or at the timeout required by the new element, whichever comes first.

Linux 使用相同的例程来处理从网络设备接收到的新请求请求和从代理队列中出列的请求请求,如图27-9所示 (适用于 IPv4 和 IPv6)。因此,例程需要能够区分以下两类数据包:

Linux uses the same routine to process both new solicitation requests received from network devices and solicitation requests dequeued from the proxy queue, as shown in Figure 27-9 (both for IPv4 and IPv6). Because of this, the routine needs to be able to distinguish between the following two categories of packets:

  • 刚刚收到并因此需要排队的数据包 proxy_queue

  • Packets that have just been received and that therefore need to be queued to proxy_queue

  • 延迟后已从代理队列中出队的数据包

  • Packets that have been dequeued from the proxy queue after a delay

sk_buffLinux 通过对缓冲区结构的一个字段使用特殊值来区分这两种情况: skb->stamp.tv_sec。该字段是一个时间戳,在首次接收到数据包时netif_rx (请参阅第 10 章)初始化为本地接收时间。之后调用相邻的协议处理程序netif_rx(请参阅第 13 章),因此在协议处理程序内访问时, 的值skb->stamp.tv_sec通常为非负数。然而,当一个条目排队到 时proxy_queue,它stamp.tv_sec被初始化为LOCALLY_ENQUEUED,相当于值-2。因此,当数据包被代理定时器出队并传递到相邻的协议处理程序时,处理程序将知道缓冲区来自代理队列,如图27-9所示。

Linux distinguishes between these two cases by using a special value for one field of the sk_buff buffer structure: skb->stamp.tv_sec. The field is a timestamp that is initialized to the local receive time by netif_rx (see Chapter 10) when a packet is first received. The neighboring protocol handlers are called after netif_rx (see Chapter 13) and therefore the value of skb->stamp.tv_sec is normally non-negative when accessed within the protocol handlers. However, when an entry is queued to proxy_queue, its stamp.tv_sec is initialized to LOCALLY_ENQUEUED, which is equivalent to the value -2. Thus, when the packet is dequeued by the proxy timer and is passed to the neighboring protocol handler, the handler will know the buffer comes from the proxy queue, as shown in Figure 27-9.

proxy_delay为 0 时,不使用缓冲并立即处理请求。当proxy_delay 非零时,请求将排队到 中proxy_queue。如前所述,proxy_delay延迟中引入了从 0 到 0 的随机值,以防止来自不同主机的大量同时请求。[ * ]因此,条目的处理顺序可能与接收条目的顺序不同,但这不是问题。

When proxy_delay is 0, no buffering is used and requests are processed right away. When proxy_delay is nonzero, requests are queued into proxy_queue. As explained earlier, a random value from 0 to proxy_delay is introduced into the delay to prevent a flood of simultaneous requests from different hosts.[*] Because of this, entries may not be processed in the same order in which they are received, but that is not a problem.

每设备代理和每目的地代理

Per-Device Proxying and Per-Destination Proxying

当在设备上全局启用代理时,状态信息很简单:设备只需与一个表示是否启用代理的标志关联即可。另一方面,按目的地代理需要存储代理地址。主机应拦截请求请求的这些 L3 地址存储在neigh_table->phash_buckets哈希表中(参见图 27-2),可以使用 pneigh_lookup的代理对应项来搜索该哈希表neigh_lookup

When proxying is globally enabled on a device, state information is simple: the device just needs to be associated with a flag that says whether proxying is enabled. Per-destination proxying, on the other hand, needs to store the proxied addresses. These L3 addresses for which the host should intercept solicitation requests are stored in the neigh_table->phash_buckets hash table (see Figure 27-2), which can be searched with pneigh_lookup, the proxying counterpart of neigh_lookup.

就像“缓存neigh_lookup”部分中描述的那样,它接受一个输入参数,该参数可用于在搜索失败时强制创建结构。与 不同的是,没有最大尺寸。此外,没有垃圾收集,因为考虑到其元素的不同性质,它没有意义:这些地址被显式配置为被代理,因此它们保持有效,直到它们被显式配置为不再被代理为止。在 IPv4 中,这些地址只能手动配置。在IPv6中,这些地址在某些条件下也可以由协议配置。pneigh_lookupneighbourhash_bucketsphash_buckets

Like neigh_lookup, which is described in the section "Caching," pneigh_lookup accepts an input parameter that can be used to force the creation of a neighbour structure if the search fails. Unlike hash_buckets, phash_buckets does not have a maximum size. Furthermore, there is no garbage collection because it would not make sense given the different nature of its elements: these addresses are explicitly configured to be proxied, so they remain valid until they are explicitly configured not to be proxied anymore. In IPv4, these addresses can be configured only manually. In IPv6, these addresses can also be configured by the protocol under certain conditions.

新的条目可以通过相邻协议动态地添加到表中,或者通过管理命令静态地添加到表中(参见第29章中的“相邻系统管理” 部分)。可以使用 删除条目 。pneigh_delete

New entries can be added to the table dynamically by the neighboring protocols or statically by an administrative command (see the section "System Administration of Neighbors" in Chapter 29). Entries can be removed with pneigh_delete.

L2 标头缓存

L2 Header Caching

L2 标头从一台主机发送到另一台主机的所有数据包往往都是相同的。这与 L3 标头形成对比,L3 标头通常具有不同的 ID、发生分段时不同的分段偏移以及从一个数据包更改为下一个数据包的其他方式。因此,内核不会缓存 L3 标头,但会缓存 L2 标头。复杂的 L2 协议可能没有一致的标头,但最常见的协议(例如以太网)却具有一致的标头。(有关以太网的更多详细信息,请参阅第 13 章。)使用缓存时,出口设备的设备驱动程序必须支持它。

L2 headers tend to be the same on all packets sent from one host to another. This is in contrast to L3 headers, which usually have different IDs, different fragment offsets when fragmentation occurs, and other ways of changing from one packet to the next. Therefore, the kernel doesn't bother caching L3 headers, but it does cache L2 headers. Complex L2 protocols may not have consistent headers, but the most common ones, such as Ethernet, do. (See Chapter 13 for more details on Ethernet.) When caching is used, the device driver of the egress device has to support it.

将第一个数据包发送到给定目的地后,驱动程序将 L2 标头保存在名为 的专用结构中hh_cache。下次将数据包发送到同一个邻居时,发送方不需要逐个字段填写 L2 标头,而只需从缓存中一次性复制即可。与其他相邻协议结构的关系 已在前面的“主要数据结构hh_cache”部分以及该部分的图 27-2中介绍过 。该结构在第 29 章的“ hh_cache 结构”部分中有更详细的描述。

After sending the first packet to a given destination, a driver saves the L2 header in a dedicated structure named hh_cache. The next time a packet is sent to the same neighbor, the sender does not need to fill in the L2 header field by field, but simply copy it in one shot from the cache. The relationship of hh_cache to other neighboring protocol structures was introduced earlier in the section "Main Data Structures," and in Figure 27-2 in that section. The structure is described in more detail in the section "hh_cache Structure" in Chapter 29.

L2 层的报头缓存与 L3 层路由子系统的缓存相关联,如第 33 章所述。如图 27-1所示 ,dst_entryIPv4 路由缓存的每个元素都包含一个指向与下一跳关联的结构的指针,并且该条目包含缓存标头neighbour的列表。hh_cache虽然可以为每个邻居缓存多个标头,但通常只缓存一个。图 27-1显示了以太网报头的情况。[ * ]

Header caching at the L2 layer is tied to caching by the routing subsystem at the L3 layer, described in Chapter 33. As shown in Figure 27-1, each dst_entry element of the IPv4 routing cache includes a pointer to the neighbour structure associated with the next hop, and that entry includes a list of hh_cache cached headers. While multiple headers could be cached for each neighbor, usually only one is cached. Figure 27-1 shows the case of an Ethernet header.[*]

不同缓存中数据结构之间的关系如图27-10所示 。该图显示了一个简单的场景:通过路由器连接两个 LAN。从Host C 的角度来看,Host A、Host B 和Router 通过下一跳Router 可达。如果主机 C 与这三台主机中的每一台交换了一些数据,则其路由缓存将为dst_entry每台主机都有一个(路由缓存元素)。如前所述,每个 都有一个到相关下一跳结构的dst_entry链接,在本例中是路由器。neighbour请注意,neighbourdst_entry结构都具有指向hh_cache入口。另请注意,一个缓存标头就足够了,因为所有三个主机(主机 A、主机 B 和路由器)都可以通过同一下一跃点到达。

The relationship between the data structures in different caches is shown in action in Figure 27-10. The figure shows a simple scenario: two LANs connected by a router. From Host C's perspective, Host A, Host B and Router are reachable via the next hop Router. If Host C had exchanged some data with each of those three hosts, its routing cache would have a dst_entry (routing cache element) for each one. As explained previously, each dst_entry has a link to the neighbour structure of the associated next hop, which in this case is Router. Note that both the neighbour and the dst_entry structures have a link to the hh_cache entry. Note also that one cached header is sufficient because all three hosts (Host A, Host B, and Router) are reachable via the same next hop.

与路由一起使用的缓存示例

图 27-10。与路由一起使用的缓存示例

Figure 27-10. Example of caches used with routing

图27-10hh_cache中结构( hh_refcnt)的引用计数为4,即虚线链接的数量。结构所持有的引用 和结构所持有的引用 均通过设置,如“路由和 L2 标头缓存之间的链接”部分中所述。结构上的引用计数 通过直接调用来增加。如“邻居结构的引用计数”部分所示,内核为 结构提供了一个特殊的包装器。dst_entryneighbourneigh_hh_inithh_cacheatomic_incneighbour

The reference count on the hh_cache structure (hh_refcnt) in Figure 27-10 is 4, which is the number of dotted links. Both the references held by the dst_entry structures and the reference held by the neighbour structure are set via neigh_hh_init, as described in the section "Link Between Routing and L2 Header Caching." Reference counts on hh_cache structures are incremented via direct calls to atomic_inc. As shown in the section "Reference Counts on Neighbour Structures," the kernel provides a special wrapper for neighbour structures.

L2 标头缓存的使用对于 L3 协议是透明的,如后面的“相邻协议和 L3 传输功能之间的交互”部分所示。

The use of L2 header caching is transparent to L3 protocols, as shown in the later section, "Interaction Between Neighboring Protocols and L3 Transmission Functions."

设备驱动程序提供的方法

Methods Provided by the Device Driver

为了使用 L2 缓存,设备驱动程序必须通过提供在结构中存储 L2 标头的例程来配合hh_cache。在第 2 章中,我描述了net_device。根据您从阅读本章中获得的知识,现在值得回顾其中一些方法。本节我们将以以太网设备定义的为例;这些方法是在ether_setup (参见第8章)中初始化的。

For L2 caching to be used, the device driver has to cooperate by providing a routine that stores the L2 header in an hh_cache structure. In Chapter 2, I described the methods or virtual functions in the net_device data structure. It is worthwhile reviewing some of those methods now in light of the knowledge you have developed from reading this chapter. We will take the ones defined for Ethernet devices as examples for this section; these methods are initialized in ether_setup (see Chapter 8).

hard_header
hard_header

逐个字段填充L2头。当设备不使用任何L2头时(参见第26章“特殊情况”部分),该方法被初始化为NULL。邻居的方法检查constructorhard_headerneigh_ops以从虚拟表中选择正确的方法;NULL 条目会被特殊处理。请参阅第 28 章“ arp_constructor 函数的启动”部分中的 ARP 示例。 ND 协议的行为类似。ndisc_constructor

hard_header当设备驱动程序不支持标头缓存时(如 中neigh_connected_output),或者当标头尚未准备好并因此不存在于缓存中时(如 中neigh_resolve_output),使用。调用时,hard_header通常会skb在输入中接收缓冲区。该skb->data字段指向 L3 标头的开头。hard_header用于skb_push留出前置 L2 标头所需的空间。

Fills in the L2 header field by field. When the device does not use any L2 header (see the section "Special Cases" in Chapter 26), this method is initialized to NULL. The neighbor's constructor method checks hard_header to select the right neigh_ops method from the virtual table; a NULL entry is treated specially. See the ARP example in the section "Start of the arp_constructor Function" in Chapter 28. ndisc_constructor acts similarly for the ND protocol.

hard_header is used when header caching is not supported by the device driver (as in neigh_connected_output), or when the header is not ready yet and therefore is not present in the cache (as in neigh_resolve_output). When invoked, hard_header usually receives an skb buffer in input. The skb->data field points to the beginning of the L3 header. hard_header uses skb_push to make the space needed to prepend the L2 header.

hard_header_cache
hard_header_cache

将 L2 标头缓存在hh_cache 。当然,只有在第一次将数据包发送到邻居时,并且只有在所有标头字段都准备就绪时(例如,不在地址解析完成之前),才会完成此操作。

Caches an L2 header in an hh_cache structure. This is done, of course, only the first time a packet is sent to a neighbor, and only when all of the header's fields are ready (for instance, not before address resolution has completed).

header_cache_update
header_cache_update

hh_cache通过用新条目替换其缓存的标头来更新现有条目。该函数通常从内部调用neigh_update_hhs,用于 更新邻居条目(请参阅“更新邻居信息:neigh_updateneigh_update ”部分)。

Updates an existing hh_cache entry by replacing its cached header with a new one. This function is usually called from within neigh_update_hhs, which is used by neigh_update to update a neighbor entry (see the section "Updating a Neighbor's Information: neigh_update").

hard_header_parse
hard_header_parse

从缓冲区检索源 L2 地址并返回其长度。

Retrieves the source L2 address from a buffer and returns its length.

rebuild_header
rebuild_header

已弃用并保留只是为了向后兼容 2.2 之前的内核设备驱动程序。使用此功能的设备无法使用 中缓存的解析地址 dst_entry->neigh

Deprecated and kept only for backward compatibility with pre-2.2 kernel device drivers. Devices using this function cannot use the cached resolved address in dst_entry->neigh.

路由和 L2 标头缓存之间的链接

Link Between Routing and L2 Header Caching

当邻居条目刚刚创建时,neigh->output指向neigh_resolve_output,它负责将邻居与 L2 标头关联起来。因此,L3层的传输功能(在第21章和本章“相邻协议和L3传输功能之间的交互”部分中描述)透明地触发地址解析。

When a neighbor entry has just been created, neigh->output points to neigh_resolve_output, which is in charge of associating the neighbor with the L2 header. Thus, transmitting functions at the L3 layer (described in Chapter 21 and in the section "Interaction Between Neighboring Protocols and L3 Transmission Functions" in this chapter) transparently trigger address resolution.

这是来自的快照neigh_resolve_output,其中是“主要数据结构dst”部分中简要介绍的路由表缓存条目:

Here is a snapshot from neigh_resolve_output, where dst is the routing table cache entry briefly introduced in the section "Main Data Structures":

         if (dev->hard_header_cache && !dst->hh) {
                 write_lock_bh(&neigh->lock);
                 if (!dst->hh)
                         neigh_hh_init(neigh, dst, dst->操作->协议);
                 err = dev->hard_header(skb, dev, ntohs(skb->协议),
                                        neigh->ha, NULL, skb->len);
                 write_unlock_bh(&neigh->lock);
         } 别的 {
                 read_lock_bh(&neigh->lock);
                 err = dev->hard_header(skb, dev, ntohs(skb->协议),
                                        neigh->ha, NULL, skb->len);
                 read_unlock_bh(&neigh->lock);
         }
         if (dev->hard_header_cache && !dst->hh) {
                 write_lock_bh(&neigh->lock);
                 if (!dst->hh)
                         neigh_hh_init(neigh, dst, dst->ops->protocol);
                 err = dev->hard_header(skb, dev, ntohs(skb->protocol),
                                        neigh->ha, NULL, skb->len);
                 write_unlock_bh(&neigh->lock);
         } else {
                 read_lock_bh(&neigh->lock);
                 err = dev->hard_header(skb, dev, ntohs(skb->protocol),
                                        neigh->ha, NULL, skb->len);
                 read_unlock_bh(&neigh->lock);
         }

如果设备可以使用标头缓存(即已hard_header_cache设置)但标头尚未缓存 ( !dst->hh),neigh_resolve_output则必须初始化并缓存 L2 标头。它通过调用该函数来完成此neigh_hh_init操作,该函数创建 hh_cache条目并将其链接到路由表缓存条目(图 27-10dst->hh中虚线所示的操作)。

If the device can use header caching (that is, hard_header_cache is set) but the header has not been cached yet (!dst->hh), neigh_resolve_output has to initialize and cache the L2 header. It does so by calling the neigh_hh_init function, which creates the hh_cache entry and links it to the dst->hh routing table cache entry (the operation shown by dotted lines in Figure 27-10).

相反,如果设备不支持缓存,则 L2 标头将填充为 hard_header

If, instead, caching is not supported by the device, the L2 header is filled in with hard_header.

在这两种情况下,neighbour结构都是在锁的保护下访问的。但第一种情况以独占模式访问结构体来写入头;第二个以共享模式访问它。请注意,在第一种情况下,在获取锁后再次neigh_resolve_output检查状态。dst->hh这是避免锁竞争条件的标准方法;在这种情况下这样做是因为dst->hh在先前的检查和获取锁之间可能已被另一个 CPU 初始化。

In both cases, the neighbour structure is accessed under the protection of a lock. But the first case accesses the structure in exclusive mode to write the header; the second accesses it in shared mode. Note that in the first case, neigh_resolve_output checks the status of dst->hh once more after having acquired the lock. This is a standard way to avoid a race condition with locks; it is done in this case because dst->hh may have been initialized by another CPU between the previous check and the acquisition of the lock.

的内部结构neigh_resolve_output见图 27-13

For the internals of neigh_resolve_output, see Figure 27-13.

缓存失效和更新

Cache Invalidation and Updating

缓存的标头可能包括许多不同的字段,但最有可能更改并因此使缓存的标头无效的两个字段是源地址和目标地址。

A cached header may include many different fields, but the two that are most likely to change and therefore invalidate the cached header are the source and destination addresses.

当本地设备更改其 L2 地址时,与该地址关联的所有缓存标头都会过期。当相邻子系统收到有关此事件的通知(如“通过 neigh_changeaddr 更新(网络设备通知链)NETDEV_CHANGEADDR ”一节中所述)时,它会刷新与该设备关联的所有条目,从而也使所有关联的缓存 L2 标头无效。neighbour

When a local device changes its L2 address, all of the cached headers associated with the address become out of date. When the neighboring subsystem is notified about this event (NETDEV_CHANGEADDR, described in the section "Updates via neigh_changeaddr (netdevice notification chain)"), it flushes all of the neighbour entries associated with the device, thereby also invalidating all of the associated cached L2 headers.

当系统检测到邻居的 L2 地址发生变化时,会调用 neigh_update_hhs. neighbour该函数通过依次调用header_cache_update设备驱动程序提供的函数(并在“设备驱动程序提供的方法”部分中介绍)来更新该结构使用的所有缓存标头。请参阅“更新邻居信息:neigh_update ”部分。

When the system detects that the L2 address of a neighbor has changed, it invokes neigh_update_hhs. This function updates all of the cached headers used by that neighbour structure by invoking, in turn, the header_cache_update function provided by the device driver and introduced in the section "Methods Provided by the Device Driver." See the section "Updating a Neighbor's Information: neigh_update."

协议初始化和清理

Protocol Initialization and Cleanup

每个相邻协议都有一个初始化函数,如果协议包含在内核中,则在启动时执行;如果协议已编译为模块,则在模块加载时执行。对于其他内核子系统,初始化函数会分配子系统正常运行所需的所有资源。Linux内核中实现的四种邻居协议的四种初始化函数如表27-1所示。

Each neighboring protocol has an initialization function that is executed at boot time if the protocol is included in the kernel, or at module load time if the protocol has been compiled as a module. As for other kernel subsystems, the initialization function allocates all of the resources that are needed by the subsystem to function properly. The four initialization functions of the four neighboring protocols implemented in the Linux kernel are listed in Table 27-1.

表 27-1。邻近协议初始化/清理函数

Table 27-1. Neighboring protocol init/cleanup functions

协议

Protocol

初始化函数

Init function

清理功能

Cleanup function

文件

File

ARP

ARP

arp_init [ * ]

arp_init [*]

没有任何

None

net/ipv4/arp.c

net/ipv4/arp.c

邻居发现 (ND)

Neighbor Discovery (ND)

ndisc_init

ndisc_init

ndisc_cleanup

ndisc_cleanup

net/ipv6/ndisc.c

net/ipv6/ndisc.c

数字化网络

DECnet

dn_neigh_init

dn_neigh_init

dn_neigh_cleanup

dn_neigh_cleanup

net/decnet/dn_neigh.c

net/decnet/dn_neigh.c

IP 上的 ARP ( clip) [ * ]

ARP over IP (clip)[*]

atm_clip_init

atm_clip_init

atm_clip_exit

atm_clip_exit

net/atm/clip.c

net/atm/clip.c

[ * ]在第28章“ ARP协议初始化arp_init一节中描述。

[*] arp_init is described in the section "ARP Protocol Initialization" in Chapter 28.

[ * ]clip代表 ARP 下的特例,不是一个独立的协议。因此,与其他三个协议不同,它clip不向 注册 neigh_table_init,而是自行完成其初始化(例如内存池分配)。基本上,它初始化其neigh_table结构并让 ARP 协议 ( arp_bind_neighbour) 处理它。

[*] clip represents a special case under ARP, not an independent protocol. Therefore, unlike the other three protocols, clip does not register with neigh_table_init, but accomplishes its initialization (such as memory pool allocation) by itself. Basically, it initializes its neigh_table structure and lets the ARP protocol (arp_bind_neighbour) take care of it.

以下是这些函数完成的一些常见任务:

Here are some of the common tasks accomplished by these functions:

  • 使用初始化该neigh_table结构 neigh_table_init

  • Initialize the neigh_table structure with neigh_table_init.

  • 如果需要,在/proc文件系统中注册一组变量 (通常是为了允许管理员进行调整)。

  • Register a group of variables in the /proc filesystem if needed (usually to allow tuning by an administrator).

  • 注册协议处理程序。IPv4 注册arp_rcv以使用 ARP(请参阅第 13 章)。IPv6 中的邻居处理是更通用的协议 ICMPv6 的一部分,因此 IPv6 注册了一个 ICMPv6 协议处理程序,该处理程序为那些与其邻居协议 ND 相关的 ICMPv6 消息调用IPv6 的arp_rcv( ) 对应项。ndisc_rcv

  • Register a protocol handler. IPv4 registers arp_rcv to use ARP (see Chapter 13). Neighbor handling in IPv6 is part of the more general-purpose protocol, ICMPv6, so IPv6 registers an ICMPv6 protocol handler, which invokes IPv6's counterpart of arp_rcv (ndisc_rcv) for those ICMPv6 messages that have to do with its neighboring protocol, ND.

neigh_table_init完成以下任务:

neigh_table_init accomplishes the following:

  • 分配内存池为neighbour结构保留内存。

  • Allocates a memory pool to reserve memory for neighbour structures.

  • 分配一个neigh_statistics收集有关协议的统计信息的结构。请参阅第 29 章中的“ neigh_statistics 结构”部分。

  • Allocates a neigh_statistics structure that collects statistics about the protocol. See the section "neigh_statistics Structure" in Chapter 29.

  • 分配两个哈希表hash_bucketsphash_buckets,分别用作解析关联的缓存和代理地址的数据库。见图27-2

  • Allocates the two hash tables hash_buckets and phash_buckets, used respectively as the cache for resolved associations and as a database of proxied addresses. See Figure 27-2.

  • 在/proc/net下创建一个文件,可用于转储缓存的内容。文件名取自neigh_table->id.

  • Creates a file under /proc/net that can be used to dump the contents of the cache. The name of the file is taken from neigh_table->id.

  • 启动gc_timer垃圾收集器计时器。请参阅“垃圾收集”部分。

  • Starts the gc_timer garbage collector timer. See the section "Garbage Collection."

  • 初始化(但尚未启动)proxy_timer代理计时器和关联的队列proxy_queue请参阅“延迟处理招标请求”部分。

  • Initializes (but does not start yet) the proxy_timer proxy timer and the associated queue, proxy_queue. See the section "Delayed processing of solicitation requests."

  • neigh_table结构添加到neigh_tables全局列表。后者受到锁的保护,如图27-2所示。

  • Adds neigh_table structures to the neigh_tables global list. The latter is protected by a lock, as shown in Figure 27-2.

  • 初始化一些其他参数,例如reachable_time.

  • Initializes a few other parameters, such as reachable_time.

当协议通过模块运行并且模块被卸载时,neigh_table_clear调用它来撤消neigh_table_init初始化时所做的操作,并清除协议在其生命周期内分配的任何其他资源,例如计时器和队列。

When a protocol is run through a module and the module is unloaded, neigh_table_clear is called to undo what neigh_table_init did at initialization time and to clear any other resources allocated by the protocol during its lifetime, such as timers and queues.

表 27-1neigh_table_clear显示了用于清理协议资源的协议清理函数。IPv4是唯一不能编译为模块的,因此ARP不具有清理功能。

Table 27-1 shows the protocol cleanup functions that use neigh_table_clear to clean up protocol resources. IPv4 is the only one that cannot be compiled as a module, so ARP does not have a cleanup function.

与其他子系统的交互

Interaction with Other Subsystems

相邻子系统通过在发生特定事件时生成通知和接收通知来与其他子系统交互。以下是这些交互中涉及的一些其他子系统:

The neighboring subsystem interacts with other subsystems, both by generating and by receiving notifications when specific events take place. Here are some of the other subsystems involved in these interactions:

路由
Routing

该层的缓存与相邻层的报头缓存之间的关系在“ L2报头缓存”部分中描述。

The relationship between caching at this layer and header caching at the neighbor layer is described in the section "L2 Header Caching."

流量均衡器 (TEQL)
Traffic equalizer (TEQL)

TEQL 是流量控制的排队规则之一,可以通过 IPROUTE2 包的tc命令进行配置。此功能在 L3 层对一组链路进行分组,并在将数据包传输到给定目的地时以循环方式使用它们。对邻居协议的影响是单个IP​​地址(主地址)的解析实际上可能会触发多个从IP地址的解析。

由于组中的每个链路都必须解析 L3-L2 地址绑定,因此当从一个从站移动到另一个从站时,组中设备的第一轮将需要解析该绑定。

TEQL is one of Traffic Control's queuing disciplines that can be configured through the IPROUTE2 package's tc command. This feature groups a set of links at the L3 layer and uses them in round-robin fashion when transmitting packets to a given destination. The impact on the neighboring protocol is that the resolution of a single IP address (the master address) may actually trigger the resolution of multiple slave IP addresses.

Because each link in the group has to resolve the L3-L2 address binding, the first round over the devices in the group will need that binding to be resolved when moving from one slave to another.

网络安全协议
IPsec

IPsec 定义了一系列在传输数据包之前需要应用于数据包的转换,尤其是加密。因此,如果在图27-1中加入IPsec的效果,就会dst_entry以链表的形式显示多个结构体,并且只有最后一个会有一个指向结构体的指针neighbour(参见第33章33-5)。

IPsec defines a series of transformations that need to be applied to a packet before it can be transmitted—notably encryption. Because of this, if the effects of IPsec were added to Figure 27-1, it would show multiple dst_entry structures in a linked list, and only the last one would have a pointer to a neighbour structure (see Figure 33-5 in Chapter 33).

网络过滤器(iptables)
Netfilter (iptables)

Netfilter 钩子被放置在影响数据包的入口、出口和转发的各个点;由于这些可能会影响所有流量,因此它们也会影响相邻层上的请求请求和响应。Netfilter 和相邻协议之间的交互独立于相邻基础设施进行处理,部分原因是不同的相邻协议位于网络堆栈的不同层。

第 28 章中的图 28-13显示了 Netfilter 和 ARP 如何通过三个专用挂钩点NF_ARP_INNF_ARP_OUT、 和进行交互NF_ARP_FORWARD。与 ARP 不同,ND 位于其 L3 协议 IPv6 之上,因此可以使用用于 IPv6 流量的默认NF_IP6_PRE_ROUTINGNF_IP6_POST_ROUTINGNF_IP6_LOCAL_IN和挂钩对其进行防火墙保护。NF_IP6_LOCAL_OUT要了解这些 IPv6 挂钩点在 IPv6 堆栈中的位置,请参考第18 章图 18-1中描述的 IPv4 对应部分。

Netfilter hooks are placed at various points affecting the ingress, egress, and forwarding of packets; as these potentially affect all traffic, they affect solicitation requests and responses on the neighboring layer, too. The interaction between Netfilter and the neighboring protocols is taken care of independently from the neighboring infrastructure, partly because different neighboring protocols sit at different layers of the network stack.

Figure 28-13 in Chapter 28 shows how Netfilter and ARP interact by means of the three dedicated hook points NF_ARP_IN, NF_ARP_OUT, and NF_ARP_FORWARD. Unlike ARP, ND sits on top of its L3 protocol, IPv6, so it can be firewalled with the default NF_IP6_PRE_ROUTING, NF_IP6_POST_ROUTING, NF_IP6_LOCAL_IN, and NF_IP6_LOCAL_OUT hooks used for IPv6 traffic. To get an idea where those IPv6 hook points are positioned inside the IPv6 stack, take as a reference the IPv4 counterpart depicted in Figure 18-1 in Chapter 18.

在下面的小节中,我们将从相邻子系统的角度看到其中一些交互。

In the following subsections, we'll see some of these interactions from the point of view of the neighboring subsystem.

相邻层生成的事件

Events Generated by the Neighboring Layer

当邻居被分类为不可达并因此进入该NUD_FAILED状态时,邻居层执行该neigh_ops->error_report功能,通知上层有关故障的信息。例如,在 ARP 失败的情况下,将通知 IPv4 层。所有这些都由“定时器neigh_timer_handler”部分中描述的定时器处理程序负责。

When a neighbor is classified as unreachable, and therefore enters the NUD_FAILED state, the neighboring layer executes the neigh_ops->error_report function, which notifies the upper layer about the failure. For example, in the case of an ARP failure, the IPv4 layer would be notified. All of this is taken care of by neigh_timer_handler, the timer handler described in the section "Timers."

相邻层接收的事件

Events Received by the Neighboring Layer

正如我们所看到的,只要相邻系统维护的条目的主要组成部分之一(L3 地址、L2 地址或设备)发生更改,它们就会变得无效。因此,内核必须确保每当这些信息之一发生变化时,都会通知相邻协议。这是通过相邻子系统提供的两个主要功能来完成的:

As we have seen, entries maintained by the neighboring system become invalid whenever one of their main constituents—L3 address, L2 address, or device—changes. Therefore, the kernel must make sure the neighboring protocols are notified whenever one of these pieces of information changes. This is accomplished through two main functions provided by the neighboring subsystem:

neigh_ifdown
neigh_ifdown

外部内核子系统可以调用的通用函数,以通知相邻子系统有关设备和 L3 地址的更改。有关 L3 地址更改的通知由 L3 协议发送。

A generic function that external kernel subsystems can invoke to notify the neighboring subsystem about changes to devices and L3 addresses. Notifications about changes to L3 addresses are sent by L3 protocols.

neigh_changeaddr
neigh_changeaddr

当本地设备的 L2 地址发生更改时,相邻协议可以调用以更新协议缓存的函数。每个协议都可以向内核注册以获得这些事件的通知。有关 ARP 的示例,请参阅第 28 章中的“接收事件”部分。当用户命令更改设备的硬件地址时,内核会发送有关 L2 地址更改的通知。

A function that neighboring protocols can invoke to update a protocol's cache when the L2 address of a local device has changed. Each protocol can register with the kernel to be notified of these events. See the section "Received Events" in Chapter 28 for an example involving ARP. Notifications about changes to L2 addresses are sent by the kernel when a user command changes the hardware address of a device.

通过 neigh_ifdown 更新

Updates via neigh_ifdown

图 27-11总结了生成相邻协议感兴趣的外部事件的活动和功能。主要活动包括:

Figure 27-11 summarizes the activities and functions that generate the external events in which the neighboring protocols are interested. Among the main events are:

设备关闭
Device shutdown

每个邻居条目都与一个设备相关联。因此,如果设备关闭,则必须删除所有关联的条目。更准确地说,该事件代表的不是设备本身的关闭,而是导致设备上的 L3 配置被清除,并且导致 L3 地址和 L2 地址之间的关联无效。

相反的情况,即设备被添加到系统中,相邻子系统不感兴趣。

Each neighbor entry is associated with a single device. Therefore, if a device is shut down, all of the associated entries have to be removed. To be more exact, the event represents not the shutdown of the device itself, but the clearing of the L3 configuration on the device that results, and that renders the association between the L3 address and L2 address invalid.

The opposite case, of a device being added to the system, is not of interest to the neighboring subsystem.

L3层地址变化
L3 layer address change

如果管理员更改接口的配置,则以前可通过该接口访问的主机可能无法再通过该接口访问。因此,更改接口的地址会触发对 neigh_ifdown.

If an administrator changes the configuration of an interface, hosts that were reachable through that interface before might no longer be reachable through it. For that reason, changing an interface's address triggers a call to neigh_ifdown.

协议关闭
Protocol shutdown

如果从内核中删除作为模块安装的 L3 协议,则所有关联的相邻条目将变得不可用并且必须删除。图 27-11显示了执行此类清理的两个函数:dn_neigh_cleanup删除 DECnet 和ndisc_cleanup删除 IPv6。IPv4 未表示,因为它不是作为模块实现的,并且永远不会被删除。

调用 neigh_ifdown 的上下文

图 27-11。调用 neigh_ifdown 的上下文

If an L3 protocol installed as a module is removed from the kernel, all of the associated neighboring entries become unusable and have to be removed. Figure 27-11 shows two functions that do this kind of cleanup: dn_neigh_cleanup for the removal of DECnet and ndisc_cleanup for the removal of IPv6. IPv4 is not represented because it is not implemented as a module and is never removed.

Figure 27-11. Contexts where neigh_ifdown is called

该功能neigh_ifdown非常简单。它会浏览所有neighbour结构,并使与触发事件的设备关联的结构变得不可用。neigh_ifdown(它们不会立即删除,因为对它们的引用可能会留在相邻子系统中。)以下是在每个受影响的结构上执行的主要活动neighbour

The function neigh_ifdown is pretty simple. It browses all the neighbour structures and makes the ones associated with the device that has triggered the event unusable. (They are not removed right away because references to them may be left in the neighboring subsystem.) Here are the main activities neigh_ifdown performs on each affected neighbour structure:

  • 停止所有待处理的计时器。

  • Stops all pending timers.

  • 更改条目的状态,NUD_NOARP以便尝试使用该条目的任何流量都不会触发请求请求。

  • Changes the entry's state to NUD_NOARP so that any traffic that tries to use that entry does not trigger a solicitation request.

  • 设置neigh->outputneigh_blackhole以便丢弃而不是传送发送到邻居的数据包。请参阅“用于 neigh->output 的例程”部分中对此函数的描述。

  • Sets neigh->output to neigh_blackhole so that packets sent to the neighbor are dropped rather than delivered. See the description of this function in the section "Routines used for neigh->output."

  • 调用skb_queue_purge以删除队列中所有待处理的数据包arp_queueneigh_ifdown从缓存中清除与有罪设备关联的条目后,该函数调用pneigh_ifdown对代理缓存执行相同的操作,并 proxy_queue清除代理的队列。

  • Invokes skb_queue_purge to drop all pending packets in the arp_queue queue. After neigh_ifdown clears the entries associated with the guilty device from the cache, the function calls pneigh_ifdown to do the same for the proxy cache, and the proxy's proxy_queue queue is purged.

通过 neigh_changeaddr(网络设备通知链)更新

Updates via neigh_changeaddr (netdevice notification chain)

netdevice链跟踪许多与网络相关的事件,如第 4 章所列。arp_init相邻协议在其初始化例程( 、等)中向内核注册ndisc_init,以请求来自netdevice链的通知。

The netdevice chain keeps track of numerous networking-related events, listed in Chapter 4. Neighboring protocols register with the kernel in their initialization routines (arp_init, ndisc_init, etc.) to ask for notifications from the netdevice chain.

相邻子系统最重要的事件是,当使用如下命令更改设备的 L2 地址时,NETDEV_CHANGEADDR该函数会生成该事件:do_setlink

The most important event for the neighboring subsystem is NETDEV_CHANGEADDR, which is generated by the do_setlink function when the L2 address of a device is changed with a command such as:

IP 链接设置 eth0 lladdr 01:02:03:04:05:06
ip link set eth0 lladdr 01:02:03:04:05:06

neigh_changeaddr由更改调用时,它会浏览协议缓存中的所有条目,并将与更改的设备关联的条目标记为已失效。然后垃圾收集过程会处理它们。

When neigh_changeaddr is invoked by the change, it browses all the entries in the protocol cache and marks the ones associated with the changed device as dead. The garbage collection process then takes care of them.

相邻协议和 L3 传输功能之间的交互

Interaction Between Neighboring Protocols and L3 Transmission Functions

我们在第 21 章中看到,IPv4 子系统中的数据包传输以对 的调用结束ip_finish_output2,该调用将数据包向下传递到 L2 层。在本节中,我们将了解该函数如何与相邻子系统交互。IPv6 子系统中具有相似名称和任务的函数的行为方式相同,只是它调用 ND 协议而不是 IPv4 的 ARP 协议。

We saw in Chapter 21 that packet transmission in the IPv4 subsystem ends with a call to ip_finish_output2, which passes the packet down to the L2 layer. In this section, we'll see how this function interacts with the neighboring subsystem. The function that has a similar name and task within the IPv6 subsystem behaves the same way, except that it calls the ND protocol instead of IPv4's ARP protocol.

输入skb缓冲区ip_finish_output2包含数据包数据(但没有 L2 标头),以及诸如用于传输的设备和dst内核用于做出转发决策的路由表缓存条目 ( ) 等信息。正如我们在图 27-1中看到的,该dst条目包含一个指向与 neighbour下一跳(可以是路由器或最终目的地本身)关联的条目的指针。图 27-12ip_finish_output2总结了本章中我们感兴趣的决策。

The input skb buffer, ip_finish_output2, includes the packet data (but without an L2 header), along with information such as the device to use for transmission and the routing table cache entry (dst) that was used by the kernel to make the forwarding decision. As we saw in Figure 27-1, that dst entry includes a pointer to the neighbour entry associated with the next hop (which can be either a router or the final destination itself). The decisions made by ip_finish_output2 that are of interest to us in this chapter are summarized in Figure 27-12.

如果缓存的 L2 标头可用(hh不为 NULL),则将其复制到skb缓冲区中。(skb->data指向用户数据的开头,这是应放置 L2 标头的位置。)最后,hh_output被调用。

If a cached L2 header is available (hh is not NULL), it is copied into the skb buffer. (skb->data points to the start of the user data, which is where the L2 header should be placed.) Finally, hh_output is invoked.

如果没有可用的缓存 L2 标头,ip_finish_output2则调用该neigh->output方法。正如本章前面所解释的,与 相关的精确功能neigh->output取决于条目的状态neighbour。如果L2地址准备好了,函数可能会是neigh_connected,所以可以立即填充报头并发送数据包。否则,neigh->output可能会被初始化为neigh_resolve_output,这会将数据包放入队列中 arp_queue,尝试通过发送请求请求来解析地址,并等待请求答复到达,然后传输数据包。无论数据包是立即发送还是排队发送,ip_finish_output2返回相同的值,表示成功。此后,数据包不再由 IP 子系统负责;当请求答复到达时,相邻子系统将数据包从设备中取出arp_queue并将其发送到设备。

If no cached L2 header is available, ip_finish_output2 invokes the neigh->output method. As explained earlier in this chapter, the precise function associated with neigh->output depends on the state of the neighbour entry. If the L2 address is ready, the function will probably be neigh_connected, so the header can be filled in right away and the packet transmitted. Otherwise, neigh->output will probably be initialized to neigh_resolve_output, which will put the packet in the arp_queue queue, try to resolve the address by sending a solicitation request, and wait until the solicitation reply arrives, whereupon the packet is transmitted. Whether the packet is sent immediately or queued, ip_finish_output2 returns the same value, indicating success. The packet is not the IP subsystem's responsibility after this point; when the solicitation reply arrives, the neighboring subsystem dequeues the packet from arp_queue and sends it to the device.

ip_finish_output2 函数:精简版

图 27-12。ip_finish_output2 函数:精简版

Figure 27-12. ip_finish_output2 function: compact version

如前面“邻居删除”和“通过 neigh_ifdown 更新”部分中所述,如果必要的neighbour 条目(与用于传输数据包的路由表缓存元素关联的条目 )在调用dst->neighbour时不再存在 ,则数据包将被丢弃。ip_finish_output2这种情况应该是不可能的,但如果发生这种情况,代码已准备好处理该异常。

As described earlier in the sections "Neighbor Deletion" and "Updates via neigh_ifdown," if the necessary neighbour entry (the one associated with the routing table cache element used to transmit the dst->neighbour packet) ceases to exist when ip_finish_output2 is invoked, the packet is dropped. This condition is supposed to be impossible, but the code is ready to handle this exception if it takes place.

图 27-13(a)27-13(b)提供了图 27-12的更详细版本 ,显示了根据分配给哪个函数neigh->output以及条目的状态而发生的情况neighbour。如果你看一下源代码,你会发现流程图中表示的部分neigh_resolve_output 主要由 的展开组成neigh_event_send,除了虚线框标记的部分。

Figures 27-13(a) and 27-13(b) offer a more detailed version of Figure 27-12 that shows what happens depending on which function is assigned to neigh->output and on the state of the neighbour entry. If you look at the source code, you can see that the part of the flowchart that represents neigh_resolve_output consists mostly of the expansion of neigh_event_send, with the exception of the part that is marked with the dotted box.

当新neighbour条目处于该NUD_NONE状态时,其状态将更改为NUD_INCOMPLETE并触发其计时器。计时器被初始化为立即到期。neigh_timer_handler,计时器处理程序,然后生成一个请求请求来解析该地址。

When a new neighbour entry is in the NUD_NONE state, its state is changed to NUD_INCOMPLETE and its timer is fired. The timer is initialized to expire right away. neigh_timer_handler, the timer handler, then generates a solicitation request to resolve the address.

dev_queue_xmit第 21 章介绍了。如第18章18-1所示,是相邻子系统与子系统之间的接口dev_queue_xmit流量控制子系统,位于相邻协议和设备驱动程序之间。

dev_queue_xmit was introduced in Chapter 21. As shown in Figure 18-1 in Chapter 18, dev_queue_xmit is the interface between the neighboring subsystem and the Traffic Control subsystem, which stands between the neighboring protocol and the device driver.

ip_finish_output2 函数:扩展版本

图 27-13a。ip_finish_output2 函数:扩展版本

Figure 27-13a. ip_finish_output2 function: expanded version

ip_finish_output2 函数:扩展版本

图 27-13b。ip_finish_output2 函数:扩展版本

Figure 27-13b. ip_finish_output2 function: expanded version

排队

Queuing

传送到相邻协议处理程序的入口数据包(请求和请求回复)通常会立即得到处理。然而,正如“请求请求的延迟处理”部分中所述以及如图27-9所示,代理可以配置为对它们进行排队和延迟。

Ingress packets—solicitations and replies to solicitations—delivered to the neighboring protocol handlers are normally processed right away. However, as described in the section "Delayed Processing of Solicitation Requests" and as shown in Figure 27-9, proxying can be configured to queue and delay them.

由 L3 层传输的数据包,如果它们被寻址到未解析的 L2 地址,则可以由相邻层临时排队以等待地址解析,如“相邻协议和 L3 传输功能之间的交互”部分中所述。(相比之下,相邻协议本身生成的请求和答复会立即传输。)

Packets transmitted by the L3 layer, if they are addressed to unresolved L2 addresses, can be temporarily queued by the neighboring layer to await address resolution, as described in the section "Interaction Between Neighboring Protocols and L3 Transmission Functions." (In contrast, the solicitations and replies generated by the neighboring protocols themselves are transmitted right away.)

以下小节更详细地介绍入口和出口排队

The following subsections go into more detail on both ingress and egress queuing .

入口排队

Ingress Queuing

当入口数据包排队时,所有相邻协议共享某些任务。这些包括将数据包添加到缓存、arp_queue收到请求回复时刷新以及使用代理的proxy_queue. 还有特定于单个邻居协议的任务。第 28 章详细介绍了 ARP 需要执行的操作。

All neighboring protocols share certain tasks when ingress packets are queued. These include adding packets to the cache, flushing arp_queue when a solicitation reply is received, and using the proxy's proxy_queue. There are also tasks specific to an individual neighboring protocol. Details on what ARP needs to do are described in Chapter 28.

出口排队

Egress Queuing

当传输数据包时,如果目的层L3和L2地址之间的关联尚未解析,则相邻协议将数据包临时插入队列中arp_queue。(每个相邻协议都有一个名为 的队列arp_queue,而不仅仅是 ARP 协议。)如果及时解决关联,则数据包将出队并传输;否则,数据包将被删除。否则,它会被丢弃。图 27-13arp_queue显示了当 L2 地址未准备好时IPv4 数据包如何在 ARP 中排队。

When transmitting a data packet, if the association between the destination layer L3 and L2 address has not been resolved yet, the neighboring protocol inserts the packet temporarily into the arp_queue queue. (Each neighboring protocol has a queue named arp_queue, not just the ARP protocol.) If the association is resolved in a timely manner, the packet is dequeued and transmitted; otherwise, it is dropped. Figure 27-13 shows how IPv4 packets are queued into ARP's arp_queue when the L2 address is not ready.

每个neighbour条目都有自己的小而私密的 arp_queue默认情况下,它包含三个元素,但可以通过/proc在每个设备的基础上进行配置(请参阅第 29 章)。将这些队列设为私有,而不是由所有邻居共享,可以在协议收到对给定请求的答复时更快地搜索它们,并且还可以确保更好的公平性。如果将新元素添加到专用队列时没有剩余空间,则新元素将简单地替换旧元素(请参阅 参考资料_ _neigh_event_send)。

Each neighbour entry has its own small, private arp_queue; by default it contains three elements, but it can be configured on a per-device basis via /proc (see Chapter 29). Having these queues private, rather than shared by all neighbors, makes searching them faster when the protocol receives replies to a given solicitation, and also assures a better level of fairness. If there is no space left when new elements are added to a private queue, new elements simply replace older ones (see _ _neigh_event_send).

三种常见情况下处理的数据包

图 27-14。三种常见情况下处理的数据包

Figure 27-14. Packets handled in three common situations

图 27-14显示了三种常见情况,为了简单起见忽略了代理:

Figure 27-14 shows three common cases, ignoring proxies for the sake of simplicity:

(a) 清空缓存
(a) Empty cache

步骤如下。

  1. L3 层提交请求,将数据包传输到 L3 目标地址 192.168.1.1。

  2. 查询缓存,产生缓存未命中。

  3. 数据包被暂时插入队列中。

  4. 已发送招标请求。

  5. 征集回复已到达。

  6. 缓存已填充。

  7. 队列中等待的数据包被发送出去。

The steps are as follows.

  1. The L3 layer submits a request to transmit a packet to the L3 destination address 192.168.1.1.

  2. The cache is queried, generating a cache miss.

  3. The packet is temporarily inserted into the queue.

  4. A solicitation request is sent.

  5. The solicitation reply arrives.

  6. The cache is populated.

  7. The packet waiting in the queue is sent out.

(b) 地址决议悬而未决
(b) Address resolution pending

步骤如下。

  1. L3 层提交请求,将数据包传输到 L3 目标地址 192.168.1.1。

  2. 缓存被查询。

  3. 该地址不在缓存中,但内核已经开始解析该地址的任务,因此该数据包被暂时插入队列以等待对挂起请求的答复。

当另一个数据包在情况 (a) 的步骤 (5) 处等待时,可能会发生这种情况。

The steps are as follows.

  1. The L3 layer submits a request to transmit a packet to the L3 destination address 192.168.1.1.

  2. The cache is queried.

  3. The address is not in the cache, but the kernel has already started the task of resolving the address, so the packet is temporarily inserted into the queue to wait for the reply to the pending request.

This case can occur when another packet is waiting at step (5) of case (a).

(c) 地址已解析
(c) Address already resolved

步骤如下。

  1. L3 层提交请求,将数据包传输到 L3 目标地址 192.168.1.1。

  2. 缓存被查询。

  3. 因为缓存返回命中,所以数据包可以立即发送出去。

The steps are as follows.

  1. The L3 layer submits a request to transmit a packet to the L3 destination address 192.168.1.1.

  2. The cache is queried.

  3. Because the cache returns a hit, the packet can be sent out right away.




[ * ]对于 的第一次初始化neigh->output,请检查构造函数例程的源代码(例如,arp_constructor/ndisc_constructor表示 ARP/ND)。对于ARP,请参见第28章“邻居结构的初始化”部分。

[*] For the first initialization of neigh->output, check the source code of the constructor routines (e.g., arp_constructor/ndisc_constructor for ARP/ND). For ARP, see the section "Initialization of a neighbour Structure" in Chapter 28.

[ * ]用于比较时间戳的例程,例如time_after_eq和,在include/linux/jiffies.htime_before_eq中定义。

[*] The routines used to compare timestamps, such as time_after_eq and time_before_eq, are defined in include/linux/jiffies.h.

[ ]的一部分也作为扩展流程图的一部分在图 27-13neigh_event_send中描述 。neigh_resolve_output

[] Part of neigh_event_send is also depicted in Figure 27-13 as part of the expanded neigh_resolve_output flowchart.

[ * ]请参阅第 28 章中的“ ARP​​D ” 部分和第 29 章中的“ neigh_parms 结构”部分。

[*] See the section "ARPD" in Chapter 28, and the section "neigh_parms Structure" in Chapter 29.

[ * ]更准确地说,它是由例程计算的 base_reachable_time/2 到 (3× )/2 范围内的随机值。base_reachable_timeneigh_rand_reach_time

[*] To be more exact, it is a random value in the range base_reachable_time/2 to (3×base_reachable_time)/2, as computed by the neigh_rand_reach_time routine.

[ * ] IPv4 (第 33 章rt_intern_hash中描述)和 IPv6最终都会调用。ip6_route_add_ _neigh_lookup_errno

[*] Both IPv4's rt_intern_hash (described in Chapter 33) and IPv6's ip6_route_add end up calling _ _neigh_lookup_errno.

[ * ]某些设备驱动程序允许管理员临时更改 MAC 地址(即,在重新启动电源后返回到其原始值)或永久更改 MAC 地址。该操作仅限于特殊场景,一般用户不需要。

[*] Some device drivers let the administrator change the MAC address either temporarily (i.e., it returns to its original value after a power cycle) or permanently. This operation is limited to special scenarios and is not needed by the average user.

[ * ]随机延迟是 RFC 2461 中涵盖的主题之一。该文档涉及 IPv6/ND,但 Linux 对 IPv4/ARP 也做了同样的事情。

[*] The random delay is one of the topics covered in RFC 2461. That document deals with IPv6/ND, but Linux does the same for IPv4/ARP.

[ * ]以太网标头不包括前导码和校验和,因为它们由 NIC 本身处理。

[*] The Ethernet header does not include the preamble and the checksum, because they are taken care of by the NIC itself.

第 28 章相邻子系统:地址解析协议 (ARP)

Chapter 28. Neighboring Subsystem: Address Resolution Protocol (ARP)

第 27 章描述了所有相邻协议共有的基础设施提供的服务。本章将展示 IPv4 使用的协议 ARP 如何适应基础设施的模块化设计。熟悉 ARP 的读者可能已经在前面章节中对一般邻居子系统的描述中看到了其行为的轮廓,尽管用于描述该子系统的术语更多地来自 IPv6 的 ND 协议而不是来自 ARP。

Chapter 27 described the services provided by the infrastructure common to all neighboring protocols. This chapter will show how ARP, the protocol used by IPv4, fits into the modular design of the infrastructure. Readers familiar with ARP may have seen the outlines of its behavior in the description of the general neighboring subsystem in the previous chapters, although the nomenclature used to describe the subsystem is drawn more from IPv6's ND protocol than from ARP.

通用基础设施的存在使得 ARP 的设计和实现更加简单。为了在本章中介绍 ARP,我们关注以下几点:

The presence of a common infrastructure makes the design and implementation of ARP simpler. To cover ARP in this chapter, we look at the following points:

  • 如何初始化该neigh_table结构arp_tbl以调整 ARP 的公共邻居基础设施的行为

  • How the neigh_table structure arp_tbl is initialized to tune the behavior of the common neighboring infrastructure for ARP

  • 如何neigh_parms初始化该结构以调整 ARP 的公共相邻基础设施的行为(例如,设置计时器到期时间)

  • How the neigh_parms structure is initialized to tune the behavior of the common neighboring infrastructure for ARP (e.g., to set timer expiration periods)

  • ARPOP_REQUESTARP 数据包(即/ )的接收如何ARPOP_REPLY与相邻子系统交互,以及该solicit 方法如何工作

  • How the reception of ARP packets (i.e., ARPOP_REQUEST/ARPOP_REPLY) interacts with the neighboring subsystem, and how the solicit method works

  • 该结构的初始化方式neigh_ops取决于设备类型和 L3 地址类型(单播、多播或广播)

  • How the neigh_ops structure is initialized depending on the device type and the type of L3 address (unicast, multicast, or broadcast)

  • 代理 ARP 如何使用通用基础设施

  • How proxy ARP uses the common infrastructure

  • 如何通过编译选项和特殊功能的显式配置来进一步定制 ARP 的行为

  • How the behavior of ARP can be further tailored by means of compile options and the explicit configuration of special features

  • 内核如何将一些工作移交给用户空间守护进程arpd来处理特别繁重的工作负载

  • How the kernel can hand some work over to a user-space daemon, arpd, to handle a particularly heavy workload

  • ARP与反向ARP(RARP)的关系

  • The relationship between ARP and Reverse ARP (RARP)

  • ARP 可以向其他内核子系统通知哪些事件,反之亦然

  • What events ARP can notify to other kernel subsystems, and vice versa

本章最后简要概述了 IPv6 的 ND over ARP 所做的改进。

The chapter concludes with a brief overview of the improvements made by IPv6's ND over ARP.

ARP 数据包格式

ARP Packet Format

图28-1所示为封装在以太网帧中的ARP报文。

Figure 28-1 shows an ARP packet encapsulated in an Ethernet frame.

ARP 数据包封装在以太网帧中

图 28-1。ARP 数据包封装在以太网帧中

Figure 28-1. ARP packet encapsulated in an Ethernet frame

下面是 ARP 数据包字段的逐字段描述,在 Linux 中用以下结构表示arphdr[ * ]

Here is a field-by-field description of the ARP packet's fields, represented in Linux with an arphdr structure:[*]

硬件类型
Hardware type

硬件类型标识符(例如以太网)。请参阅include/linux/if_arp.hARPHDR_ XXX中的值。

Hardware type identifier (e.g., Ethernet). See the ARPHDR_ XXX values in include/linux/if_arp.h.

协议类型
Protocol type

L3 协议标识符(例如 IPv4)。请参阅include/linux/if_ether.hETH_P_ XXX中的值。

L3 protocol identifier (e.g., IPv4). See the ETH_P_ XXX values in include/linux/if_ether.h.

硬件尺寸
Hardware size

L2 地址的大小(以八位字节为单位)(例如,以太网为 6)。

Size in octets of an L2 address (e.g., 6 for Ethernet).

协议大小
Protocol size

L3 地址的大小(以八位字节为单位)(例如,IPv4 为 4)。

Size in octets of an L3 address (e.g., 4 for IPv4).

歌剧院
Oper

操作代码,如下表所述。

Operation code, described following this list.

SHA、SPA(发送者硬件地址、发送者协议地址)
SHA, SPA (Sender Hardware Address, Sender Protocol Address)

发送方的硬件和协议地址。

Hardware and protocol addresses of the sender.

THA、TPA(目标硬件地址、目标协议地址)
THA, TPA (Target Hardware Address, Target Protocol Address)

“目标”(或接收者)的硬件和协议地址。请求请求的发送者通常将 THA 设置为 0,因为该地址正是发送者试图通过请求发现的地址。neighbour但有时发送者会尝试通过发送包含当前已知 THA 的请求来确认现有条目。

Hardware and the protocol addresses of the "target" (or receiver). The sender of a solicitation request normally sets THA to 0, because this address is just what the sender is trying to discover with the solicitation. But sometimes the sender tries to confirm an existing neighbour entry by sending a request containing the current, known THA.

由于 ARP 通常仅用于 IPv4,因此内核代码使用缩写 SIP 和 TIP(源 IP 地址和目标 IP 地址)来指代 SPA 和 STA。

Because ARP is normally used only for IPv4, kernel code uses the abbreviations SIP and TIP (Source IP address and Target IP address) to refer to SPA and STA.

include/linux/if_arp.hARPOP_ XXX中描述的操作码列表中提供了大量 ARP 消息类型。其中两个由 RARP 使用,稍后在“反向地址解析协议 (RARP) ”部分中进行描述。其中几个被一个名为 InARP 的相对较新的协议所使用,该协议是 AR​​P(在 RFC 2390 中定义)的扩展,由帧中继和 ATM 使用,超出了本书的范围。这里我们将介绍 ARP 的两个基本内容:

A large number of ARP message types are offered in the ARPOP_ XXX list of opcodes described in include/linux/if_arp.h. Two are used by RARP and are described later, in the section "Reverse Address Resolution Protocol (RARP)". Several are used by a relatively recent protocol named InARP, which is an extension to ARP (defined in RFC 2390) that is used by Frame Relay and ATM and is beyond the scope of this book. Here we will cover the two that are basic to ARP:

ARPOP_REQUEST
ARPOP_REQUEST

这用于发送请求以尝试将 L3 地址解析为 L2 地址。对于新的邻居条目,主机将消息发送到与设备硬件关联的广播地址。为了确认现有neighbour条目,主机将消息直接发送到邻居的 L2 地址。请求相当于 IPv6 所说的 邻居请求

请求还可以用于其他原因,例如“免费 ARP ”部分中讨论的原因。

This is used to send a solicitation in an attempt to resolve an L3 address to an L2 address. For a new neighbor entry, a host sends the message to the broadcast address associated with the device's hardware. To confirm an existing neighbour entry, the host sends the message directly to the neighbor's L2 address. A request is equivalent to what IPv6 calls neighbor solicitation.

Solicitations also can be used for other reasons, such as those discussed in the section "Gratuitous ARP."

ARPOP_REPLY
ARPOP_REPLY

这是响应 时发送的消息ARPOP_REQUEST。通常它会直接发送到发送请求的主机。但有时可以发送到广播地址;主机可以在更改其配置后执行此操作以更新其邻居的缓存。广播回复相当于 IPv6 所说的邻居通告

This is the message sent in answer to an ARPOP_REQUEST. Normally it is sent directly to the host that sent the request. But sometimes it can be sent to the broadcast address; a host can do this to update the caches of its neighbors after it changes its configuration. A broadcast reply is equivalent to what IPv6 calls neighbor advertisement.

ARP 数据包的目标地址类型

Destination Address Types for ARP Packets

三层地址的地址类型可以是单播、广播或多播。类型保存在neighbour结构中(作为字段),如“邻居结构的初始化neigh->type”部分中所述,并且可以通过调用例程来确定。ARP 处理每种类型的方式如下:inet_addr_type

The address type of an L3 address can be unicast, broadcast, or multicast. The type is saved in the neighbour structure (as the neigh->type field), as explained in the section "Initialization of a neighbour Structure," and can be determined by invoking the routine inet_addr_type. Each type is handled by ARP as follows:

单播
Unicast

这是最常见的情况,可以通过ARP的正常请求方法来解决。

This is the most common case, and is resolved by ARP's normal solicitation method.

播送
Broadcast

ARP 只是将 L3 广播地址直接映射到与设备关联的 L2 广播地址。

ARP simply maps the L3 broadcast address directly to the L2 broadcast address associated with the device.

组播
Multicast

ARP 使用例程arp_mc_map从 L3 多播地址导出 L2 多播地址。ARP 不需要生成请求请求,因为 L2 地址可以通过取决于硬件类型(以太网、令牌环等)的公式得出。请参阅第 26 章中的“特殊情况” 部分。

ARP uses the routine arp_mc_map to derive the L2 multicast address from the L3 multicast address. ARP does not need to generate a solicitation request, because the L2 address can be derived by a formula that depends on the hardware type (Ethernet, Token Ring, etc.). See the section "Special Cases" in Chapter 26.

ARP 事务示例

Example of an ARP Transaction

图 28-2显示了一个简单的情况,其中一台主机请求与 IP 地址 10.0.0.4 关联的 L2 地址,并且该地址的所有者进行答复。请求中的 MAC 目标(要解析的地址)为 0,表示它应该由回复者(可能是目标 IP 地址的所有者)填写。

Figure 28-2 shows a simple case where one host asks for the L2 address associated with the IP address 10.0.0.4, and the owner of that address replies. The MAC target (the address to resolve) in the request is 0, to indicate that it should be filled in by whoever (probably the owner of the target IP address) replies.

如果仔细观察图 28-2,您会发现发送方硬件地址出现两次:一次在以太网标头中,一次在 ARP 有效负载中。它们通常匹配,但并非总是匹配(请参阅第 27 章中的“充当代理”部分)。

If you look carefully at Figure 28-2, you can see that the sender hardware address is present twice: once in the Ethernet header and once in the ARP payload. They usually match, but not always (see the section "Acting As a Proxy" in Chapter 27).

更详细的示例将出现在后面的“示例”部分中。

More-detailed examples appear later in the section "Examples."

免费ARP

Gratuitous ARP

通常发送 anARPOP_REQUEST是因为发送者想要与给定的 IP 地址通信并且需要找出关联的 L2 地址。但有时发送者会生成一个ARPOP_REQUEST通知接收者一些信息,而不是询问信息。这就是所谓的免费ARP 常用于以下几种情况:

Normally an ARPOP_REQUEST is sent because the sender wants to talk to a given IP address and needs to find out the associated L2 address. But sometimes the sender generates an ARPOP_REQUEST to inform the receivers about some information, instead of asking for information. This is called gratuitous ARP and is commonly used in the following situations:

  • L2地址变更

  • Change of L2 address

  • 重复地址检测

  • Duplicate address detection

  • 虚拟IP

  • Virtual IP

下面的小节中对每一项进行了描述。

Each is described in the subsections that follow.

更改L2地址

Change of L2 Address

我们已经在第 26 章的“需要邻居协议的原因”一节中看到,如果没有协议的帮助,则无法检测到L2 地址的更改(这会使网络上其他节点的条目无效)。提前触发关联的更新是有意义的,而不是等待旧关联过期并强制每个节点启动新的协议事务(并因此遭受暂时的黑洞)。改变地址的节点通过免费ARP完成更新。有关示例,请参阅net/irda/irlan/irlan_eth.c 。neighbour

We already saw in the section "Reasons That Neighboring Protocols Are Needed" in Chapter 26, that a change of L2 address (which invalidates neighbour entries for other nodes on the network) cannot be detected without the help of a protocol. Instead of waiting for the old association to expire and forcing each node to start a new protocol transaction (and therefore suffer a temporary black hole), it makes sense to trigger the update of the association in advance. The node that changed the address accomplishes the update through gratuitous ARP. See net/irda/irlan/irlan_eth.c for an example.

ARP 使用示例

图 28-2。ARP 使用示例

Figure 28-2. Example of ARP usage

重复地址检测

Duplicate Address Detection

本地网络上的两台主机不应具有相同的 L3 地址,但此问题可能会发生,特别是在混合有静态和动态(即基于 DHCP)配置的大型网络中。重复地址的最常见原因是存在多个具有重叠地址池的 DHCP 服务器以及不正确的手动配置。

No two hosts on a local network should have the same L3 address, but this problem can happen, especially in big networks with a mix of static and dynamic (that is, DHCP-based) configurations. The most common reasons for duplicate addresses are the presence of multiple DHCP servers with overlapping address pools, and incorrect manual configurations.

为了检测是否存在重复地址,主机可以使用免费 ARP。如果您为自己的地址发送 ARP 请求,则只有当另一台主机配置了您的 IP 地址时,您才会收到回复。如果没有重复的地址,则不会收到回复。

To detect the presence of a duplicate address, a host can use gratuitous ARPs. If you send an ARP solicitation for your own address, you will receive a reply only when another host is configured with your IP address. If there is no duplicate address, no replies should be received.

让我们看一个使用图 28-3中的拓扑的示例。当主机 A 启动时,一旦将其eth0接口配置为 IP 地址 10.0.0.4,它就会发送请求,询问谁拥有 IP 地址 10.0.0.4(其自己的 IP 地址)。如果子网中没有主机配置错误,则主机 A 将不会收到回复。但由于主机 Bad_guy 配置了与主机 A 相同的 10.0.0.4 IP 地址,因此它会回复ARPOP_REQUEST,从而通知主机 A 存在重复地址。

Let's see an example using the topology in Figure 28-3. When Host A boots up, as soon as it configures its eth0 interface with IP address 10.0.0.4, it sends a request asking who has IP address 10.0.0.4 (its own IP address). If none of the hosts in the subnet was misconfigured, Host A will not receive a reply. But since Host Bad_guy is configured with the same 10.0.0.4 IP address as Host A, it replies to the ARPOP_REQUEST, thus informing Host A of the presence of a duplicate address.

当然,允许主机在大型网络上以随机间隔发送 ARP 数据包会降低性能。相反,如“零地址请求”部分所示,DHCP 服务器通常在向客户端授予地址之前发出请求,这是一种更具可扩展性的解决方案。

Of course, allowing hosts to send out ARP packets at random intervals on large networks is bad for performance. Instead, as shown in the section "Requests with zero addresses," a DHCP sever usually issues the request before granting an address to a client, which is a more scalable solution.

当您在本地接口上配置 IP 地址时,Linux 内核不会生成任何免费 ARP。然而,大多数 Linux 发行版都 安装了iputils软件包,其中包括 arping命令。arping可用于生成ARP_REQUEST帧。当您使用/sbin/ifup命令( initscripts包的一部分)启用网络接口时,它会使用arping 来检查重复的地址。

The Linux kernel does not generate any gratuitous ARP when you configure an IP address on the local interfaces. However, most Linux distributions come with the iputils package installed, which includes the arping command. arping can be used to generate ARP_REQUEST frames. When you enable a network interface with the /sbin/ifup command (part of the initscripts package), it uses arping to check for duplicate addresses.

虚拟IP

Virtual IP

免费 ARP 的另一个常见用途是允许服务器池中的故障转移。通常,为了提供冗余,站点会提供一台活动服务器以及许多处于备用模式的类似配置的主机。当活动服务器由于某种原因发生故障时,通常称为 心跳计时器的机制(通过服务器池上的某种协议实现)检测故障并触发新的活动服务器的选举。这台新服务器会生成免费 ARP 来更新网络中所有其他主机的 ARP 缓存。由于新服务器已获取旧服务器的 IP 地址,因此ARPOP_REQUEST 不会应答,但所有收件人都会相应地更新其缓存。

Another common use for gratuitous ARP is to allow failover in a pool of servers. Commonly, to provide redundancy, a site provides one active server along with a number of similarly configured hosts in standby mode. When the active server fails for some reason, a mechanism often referred to as a heartbeat timer (implemented through some protocol on the pool of servers) detects the failure and triggers the election of a new active server. This new server generates a gratuitous ARP to update the ARP cache of all the other hosts in the network. Because the new server has taken the IP address of the old server, the ARPOP_REQUEST is not answered, but all the recipients update their caches accordingly.

请注意,通过这种方式,IP 层和更高层可以保持通信,甚至不会注意到变化。当然,由于心跳是定期发送的,所以在旧服务器故障、新服务器接管后,会有一小段时间窗口,在此期间不会发送流量。因此,一些节点可能会发现故障并将其邻居条目标记为故障,直到新条目ARPOP_REQUEST到达。

Note that in this way, the IP layer and higher layers can keep communicating without even noticing the change. Of course, because heartbeats are sent out at regular intervals, a small window of time exists after the old server fails and the new one takes over, during which traffic is not delivered. So some nodes may discover the failure and mark their neighbor entries as failed until the new ARPOP_REQUEST arrives.

重复地址检测示例

图 28-3。重复地址检测示例

Figure 28-3. Example of duplicate address detection

图 28-4 [ * ]中的示例显示了两台路由器,一台充当主动角色,另一台充当备用角色 (a)。标记为“活动”的服务器的 IP 地址为 10.0.0.1。LAN2 的主机使用该路由器与 LAN1 的主机通信,反之亦然。

The example in Figure 28-4 [*] shows two routers, one taking the active role and the other taking the standby role (a). The server labeled Active has the IP address 10.0.0.1. The hosts of LAN2 use this router to communicate with the hosts of LAN1, and vice versa.

故障转移系统已到位,以便当活动路由器出现故障时,备用路由器将接管 IP 地址 10.0.0.1 并成为活动路由器 (b)。当备用路由器成为新的活动路由器时,发出免费 ARP 请求,更改所有本地主机 (c) 的条目,以便 10.0.0.1 与新的活动路由器的 L2 地址关联。来自 LAN2 的后续 IP 流量到达此路由器。新的Active路由器也向LAN1发送免费ARP请求,但这在图中没有显示。该图也没有显示现实管理员将配置的另一个细节:每个路由器的每个接口上都有第二个 IP 地址,

A failover system is in place so that when the Active router fails, the Standby router takes over the IP address 10.0.0.1 and becomes the Active router (b). When the Standby router becomes the new Active router, sends out a gratuitous ARP request that changes the entries of all local hosts (c) so that 10.0.0.1 is associated with the L2 address of the new active router. Subsequent IP traffic from LAN2 comes to this router. The new Active router also sends a gratuitous ARP request to LAN1, but this is not shown in the figure. The figure also does not show another detail that a real-life administrator would configure: each router would have a second IP address on each of its interfaces, used mainly to provide connectivity when the current role is not active.

免费 ARP 示例

图 28-4。免费 ARP 示例

Figure 28-4. Example of gratuitous ARP

从多个接口响应

Responding from Multiple Interfaces

Linux 有一个相当不寻常的设计:它认为 IP 地址属于主机而不是接口,尽管管理员总是将 IP 地址分配给特定接口。[ * ]这会产生一些管理员抱怨的影响:

Linux has a rather unusual design: it considers an IP address as belonging to a host rather than an interface, even though administrators always assign IP addresses to particular interfaces.[*] This has impacts that some administrators complain about:

  • Linux 主机会回复指定在其任何接口上配置的目标 IP 地址的任何 ARP 请求请求,即使该请求是由其他接口在此主机上接收的。为了使 Linux 的行为就像地址属于接口一样,管理员可以使用稍后在“ /proc 选项”部分中描述的 ARP_IGNORE 功能。

  • A Linux host replies to any ARP solicitation requests that specify a target IP address configured on any of its interfaces, even if the request was received on this host by a different interface. To make Linux behave as if addresses belong to interfaces, administrators can use the ARP_IGNORE feature described later in the section "/proc Options."

  • 主机可能会遇到ARP 流量问题,其中错误的接口与 L3 地址关联。这个问题在下面的文字中描述。

  • Hosts can experience the ARP flux problem, in which the wrong interface becomes associated with an L3 address. This problem is described in the text that follows.

假设您有一台主机在同一 LAN 上有两个 NIC,并且另一台主机向其中一个地址发送 ARP 请求。两个接口都收到请求,如图28-5所示,并且两个接口都回复。

Imagine you have a host with two NICs on the same LAN, and that another host sends an ARP request for one of the addresses. The request is received by both interfaces, as shown in Figure 28-5, and both interfaces reply.

ARP流量问题

图 28-5。ARP流量问题

Figure 28-5. The ARP flux problem

因此,发送请求的主机会收到对其请求的两个答复。一个来自具有正确 L2 地址 ( eth0 )的 NIC ,但另一个来自另一个 NIC 的地址 ( eth1 )。通信方在其 ARP 缓存中输入哪个地址取决于接收请求的顺序以及主机处理重复答复的方式 — 简而言之,它是不确定的。

The host sending the solicitation therefore receives two replies to its request. One comes from the NIC with the correct L2 address (eth0) but the other bears the other NIC's address (eth1). Which address is entered by the correspondent in its ARP cache depends on the order in which the requests happen to be received and the host's way of handing duplicate replies—in short, it's nondeterministic.

ARP 通量问题可以通过“可调 ARP 选项”部分中描述的功能来解决。

The ARP flux problem can be solved with the features described in the section "Tunable ARP Options."

可调 ARP 选项

Tunable ARP Options

内核允许用户通过/proc文件系统和编译时选项调整 ARP 行为我们将在第 29 章的“通过 /proc 文件系统进行调整”一节中详细了解如何配置这些功能、它们允许的设置以及它们的默认值,但我们在这里介绍一下主要的功能。

The kernel allows the user to tune the ARP behavior via both the /proc filesystem and compile-time options . We will see details on how to configure those features, their allowed settings, and their defaults in the section "Tuning via /proc Filesystem" in Chapter 29, but let's introduce the main ones here.

编译时选项

Compile-Time Options

编译时可以启用两个 ARP 选项:

Two ARP options can be enabled at compile time:

ARPD (CONFIG_ARPD)
ARPD (CONFIG_ARPD)

这允许用户空间守护进程处理 ARP,从而可以提高非常大且繁忙的网络的性能。请参阅“ ARPD ”部分。

This allows a user-space daemon to handle ARP, which can improve performance on a very large and busy network. See the section "ARPD."

UNSOLICITED ARP (CONFIG_IP_ACCEPT_UNSOLICITED_ARP)
UNSOLICITED ARP (CONFIG_IP_ACCEPT_UNSOLICITED_ARP)

默认情况下,当主机收到ARPOP_REPLY没有待处理的消息时ARPOP_REQUEST,内核会丢弃回复。然而,有时,接受它可能会有用。这个功能确定接受未经请求的回复,实际上 Linux 不再支持:代码被注释掉(在函数中arp_process)并且内核配置菜单没有提供任何启用它的方法。

不要将此功能的效果与免费 ARP 混淆。ARP_UNSOLICITED接受未经请求的ARPOP_REPLY数据包,而免费 ARP 会通过ARPOP_REQUEST. 如图28-18所示,仅接受单播主动请求。

By default, when a host receives an ARPOP_REPLY for which it had no pending ARPOP_REQUEST, the kernel drops the reply. Sometimes, however, it could be useful to accept it. This feature, which establishes that unsolicited replies are accepted, is actually not supported by Linux anymore: the code is commented out (in the function arp_process) and the kernel configuration menu does not provide any way to enable it.

Do not confuse the effect of this feature with gratuitous ARP. ARP_UNSOLICITED accepts unsolicited ARPOP_REPLY packets, whereas gratuitous ARP causes a "push" update via an ARPOP_REQUEST. As Figure 28-18 shows, only unicast unsolicited requests are accepted.

/proc 选项

/proc Options

这些功能中的大多数都可以在全局和每个设备的基础上进行配置。代码可以使用include/linux/inetdevice.hIN_DEV_ XXX中定义的宏(例如、和)来检查它们是否已启用。请参考这些宏的定义,了解哪些功能是全局的,哪些是局部的。所有宏都将设备的 IP 配置块 ( ) 作为其输入参数,该配置块通常通过例程进行检索。IN_DEV_ARP_ANNOUNCEIN_DEV_ARP_IGNOREIN_DEV_ARPFILTERnet_device->ip_ptrin_dev_get

Most of those features can be configured both globally and on a per-device basis. Code can check whether they are enabled by using the IN_DEV_ XXX macros defined in include/linux/inetdevice.h (e.g., IN_DEV_ARP_ANNOUNCE, IN_DEV_ARP_IGNORE, and IN_DEV_ARPFILTER). Please refer to the definition of those macros to see which features are global and which are local. All of the macros take, as their input parameter, the device's IP configuration block (net_device->ip_ptr), which is normally retrieved with the routine in_dev_get.

其中一些选项的引入是为了解决特定于 Linux 虚拟服务器 (LVS) 的问题。在 LVS HOWTO 的“LVS:ARP 问题”部分中,您可以找到有关这些功能的用途以及如何配置它们的详细信息。您还将找到有关以前方法的信息。

Some of those options have been introduced to address issues specific to Linux Virtual Servers (LVS). In the LVS HOWTO, in the section "LVS: the ARP problem," you can find detailed information on what these features are for and how they can be configured. You will also find information about previous approaches.

ARP_ANNOUNCE

ARP_ANNOUNCE

当生成请求的主机提供多个地址时,此选项控制哪些源 IP 地址可以放入请求请求的 ARP 标头中。表 28-1列出了允许的级别,并说明如何从本地系统上配置的 IP 地址中选择 IP 地址。“请求”部分展示了 ARP 如何使用它。 ARP_ANNOUNCE是在arp_solicit函数中处理的。

This option controls which source IP addresses can be put in the ARP headers of solicitation requests, when the host generating the request offers multiple addresses. Table 28-1 lists the allowed levels and tells how the IP address is selected from the ones configured on the local system. The section "Solicitations" shows how ARP uses it. ARP_ANNOUNCE is handled in the arp_solicit function.

表 28-1。ARP_ANNOUNCE 级别

Table 28-1. ARP_ANNOUNCE levels

价值

Value

意义

Meaning

0(默认)

0 (Default)

任何本地 IP 地址都可以。

Any local IP address is fine.

1

1

如果可能,请选择与目标地址位于同一子网内的地址。如果不可能,请使用级别 2。

If possible, pick an address that falls within the same subnet of the target address. If not possible, use level 2.

2

2

首选主要地址。

Prefer primary addresses.

ARP_忽略

ARP_IGNORE

该选项控制确定是否处理ARPOP_REQUEST数据包的标准。

This option controls the criteria that determine whether to process ARPOP_REQUEST packets.

通常,主机可以处理的所有请求都会被处理。正如“从多个接口响应”一节中所述,Linux 中的 IP 地址属于主机,而不属于其接口。因此,ARPOP_REQUEST只要在任何接口(包括环回接口)上配置了目标 IP 地址,主机就会处理该 IP 地址。[ * ]在某些情况下,例如使用 LVS,这将是一个问题。通过正确配置 ARP_IGNORE,管理员可以解决问题。有关问题的详细描述和可能的解决方案,请参阅 LVS HOWTO。

Normally, all requests that can be handled by a host are processed. As explained in the section "Responding from Multiple Interfaces," IP addresses in Linux belong to the host, not to its interfaces. Because of that, an ARPOP_REQUEST will be processed by a host as long as the target IP address is configured on any of the interfaces, including the loopback interface.[*] In some cases, such as with LVS, that would be a problem. By configuring ARP_IGNORE properly, an administrator can solve the problem. See the LVS HOWTO for a detailed description of the problem and the possible solutions.

图 28-6显示了虚拟服务器配置示例。服务器众所周知的地址显示为 VIP,它配置在虚拟服务器上的 NIC 上,并在两个真实服务器上配置为环回地址。对地址 VIP 的请求的所有回复都应仅来自虚拟服务器。但是,当虚拟服务器收到对其提供的服务的请求时,它会使用明确定义的选择算法将其转发到真实服务器之一。接收主机接受数据包,因为它们在本地配置了 VIP。两台真实服务器都ARP_IGNORE在其eth0接口上进行配置,以便它们不会响应ARPOP_REQUEST为 VIP 地址。

Figure 28-6 shows an example of virtual server configuration. The address by which the server is known to the world is shown as VIP, which is configured on an NIC on the virtual server and as the loopback address on the two real servers. All replies to requests for the address VIP should come from only the virtual server. But when the virtual server receives a request for the services it provides, it forwards it to one of the real servers using a well-defined selection algorithm. The receiving hosts accept the packets because they have VIP locally configured. Both real servers configure ARP_IGNORE on their eth0 interface so that they will not respond to ARPOP_REQUEST for the VIP address.

使用 ARP_IGNORE 的场景示例

图 28-6。使用 ARP_IGNORE 的场景示例

Figure 28-6. Example of scenario for the use of ARP_IGNORE

ARP_IGNORE是在arp_process函数中处理的。表 28-2中列出了可能的值。

ARP_IGNORE is handled in the arp_process function. Possible values are listed in Table 28-2.

表28-2。ARP_IGNORE 值

Table 28-2. ARP_IGNORE values

价值

Value

意义

Meaning

0(默认)

0 (Default)

回复任何本地地址。

Reply for any local address.

1

1

仅当接收接口上配置了目标 IP 时才回复。

Reply only if the target IP is configured on the receiving interface.

2

2

与1类似,但源IP(发送者地址)必须与目标IP属于同一子网。

Like 1, but the source IP (sender's address) must belong to the same subnet as the target IP.

3

3

仅当目标IP 的范围不是本地主机时才回复(例如,该地址不用于与其他主机通信)。

Reply only if the scope of the target IP is not the local host (e.g., that address is not used to communicate with other hosts).

4-7

4-7

预订的。

Reserved.

8

8

请勿回复。

Do not reply.

>8

>8

价值未知;接受请求。

Unknown value; accept request.

ARP_过滤器

ARP_FILTER

ARPOP_REQUEST此选项控制在多个 NIC 连接到同一 LAN 并配置在同一 IP 子网上的情况下,接口是否应回复入口。在这种情况下,每个 NIC 都会收到一份 副本ARPOP_REQUEST,您只需要一个接口(确定性选择,而不是随机选择)进行回复。此功能主要在使用 IP 源路由选项的网络中有用。

This option controls whether an interface should reply to an ingress ARPOP_REQUEST in scenarios where multiple NICs are connected to the same LAN and are configured on the same IP subnet. In this scenario, where each NIC receives a copy of the ARPOP_REQUEST, you want only one interface (chosen deterministically, not at random) to reply. This feature is useful mainly in networks where the IP source routing options are used.

我们以图28-7中的例子为例。当主机 A 尝试解析 10.0.0.1 IP 地址时,主机 B 的两个接口都会收到ARPOP_REQUEST. 对于这两个请求,主机 B 都会查阅路由表,并仅回复在主机 B 用于到达发送者 IP 地址 (10.0.0.3) 的接口上收到的请求。主机 B 的路由表显示 10.0.0.3 地址可通过eth0eth1到达。然而,我们将在第七部分中看到,当多个路由可用于任何给定的 IP 地址时,路由查找始终返回相同的路由[ * ](即第一个匹配的)。

Let's take the example in Figure 28-7. When Host A tries to resolve the 10.0.0.1 IP address, both of Host B's interfaces receive the ARPOP_REQUEST. For both requests, Host B consults the routing table and replies only to the request that was received on the interface that would be used by Host B to reach the sender's IP address (10.0.0.3). Host B's routing table shows that the 10.0.0.3 address is reachable via both eth0 and eth1. However, we will see in Part VII that when multiple routes are available toward any given IP address, a routing lookup always returns the same one[*] (i.e., the first one that matches).

配置后, ARPOP_REQUEST仅当内核知道如何到达发送者的 IP 地址,并且用于到达发送者的 IP 地址的设备与接收请求的设备相同时,才会处理入口数据包。

When configured, ingress ARPOP_REQUEST packets are processed only if the kernel knows how to reach the sender's IP address, and if the device used to reach the sender's IP address is the same as the device where the request was received.

请注意,ARP 过滤与 Netfilter 可以完成的过滤无关。两者是独立配置和执行的。

Note that ARP filtering has nothing to do with the filtering that can be done with Netfilter. The two are configured and enforced independently.

与前两个选项不同,ARP_FILTER只能启用或禁用;没有中间状态。它是在arp_process 函数中处理的。

Unlike the previous two options, ARP_FILTER can only be enabled or disabled; there are no intermediate states. It is handled in the arp_process function.

ARP_FILTER使用场景示例

图 28-7。ARP_FILTER使用场景示例

Figure 28-7. Example of scenario for the use of ARP_FILTER

介质ID

Medium ID

此功能可用于处理某些罕见的情况,即子网跨越不同的 LAN,以及提供代理 ARP 的主机在该子网上有多个 NIC 为不同的 LAN 提供服务。术语“介质”是指由单个广播地址提供服务的网络。如果集线器或交换机将两个此类介质连接在一起,则可能会出现代理 ARP 主机不恰当地充当代理的情况,代表不同 LAN 上能够自行处理请求的主机响应请求。

This is a feature that can be used to handle certain rare cases where a subnet spans different LANs, and where a host offering proxy ARP has multiple NICs on that subnet serving the different LANs. The term medium refers to a network served by a single broadcast address. If a hub or switch ties two such media together, a situation could arise where a proxy ARP host acts as a proxy inappropriately, responding to a request on behalf of a host on a different LAN that could handle the request itself.

我们已经在第 26 章中看到,代理 ARP 服务器不会回复ARPOP_REQUEST 在同一设备上接收到的可访问所请求的 IP 地址的信息。但是,当多个 NIC 连接到同一 LAN 时,此条件可能不足以确保正确的行为。让我们看一下图28-8中的例子。[ * ]

We already saw in Chapter 26 that a proxy ARP server does not reply to an ARPOP_REQUEST that is received on the same device through which the solicited IP address can be reached. However, when multiple NICs are connected to the same LAN, this condition may not be sufficient to ensure proper behavior. Let's look at the example in Figure 28-8.[*]

主机 B 在同一 LAN(介质)上配置了两个 NIC。eth0用于访问 10.0.0.0/24 子网中的所有 IP 地址,而eth1由于 /30 网络掩码而仅用于与主机 C 通信。主机 B 充当 LAN1 和 LAN3 的代理。

Host B is configured with two NICs on the same LAN (medium). eth0 is used to reach all of the IP addresses in the 10.0.0.0/24 subnet, and eth1 is used to communicate to Host C only, thanks to the /30 netmask. Host B acts as a proxy for both LAN1 and LAN3.

现在假设主机 A 需要向 HostC 传输某些内容,但没有其 L2 地址。主机 A 将发送广播,该广播将被主机 B 上的eth0eth1以及主机 CARPOP_REQUEST接收。主机 B 不应该回复,因为主机 C 会自行回复。ARPOP_REQUEST

Let's assume now that Host A needs to transmit something to HostC but does not have its L2 address. Host A will send a broadcast ARPOP_REQUEST, which will be received by both eth0 and eth1 on Host B as well as by Host C. Host B should not reply to the ARPOP_REQUEST because Host C will do so by itself.

介质ID的使用示例

图 28-8。介质ID的使用示例

Figure 28-8. Example of use of medium ID

假设主机 B 是一个代理 ARP 服务器,在其所有接口上都启用了代理,并查看当它在ARPOP_REQUEST两个接口上接收到 时,它的行为如何:

Let's suppose Host B is a proxy ARP server with proxying enabled on all of its interfaces, and see how it behaves when it receives the ARPOP_REQUEST on both of its interfaces:

eth0收到请求
Request received on eth0

根据路由表,请求的地址 10.0.0.1 可以通过不同的接口 ( eth1 ) 到达。因此,主机B处理该请求。请注意,主机 B 有两条与目标地址 10.0.0.1 匹配的路由,但网络掩码 /30 的路由更具体,因此获胜。

According to the routing table, the solicited address 10.0.0.1 is reachable via a different interface (eth1). Therefore, Host B processes the request. Note that Host B has two routes that match the destination address 10.0.0.1, but the one with netmask /30 is more specific and therefore wins.

eth1收到请求
Request received on eth1

在这种情况下,接收接口与用于到达 10.0.0.1 的接口匹配,因此主机 B 不会处理该请求。

In this case, the receiving interface and the one used to reach 10.0.0.1 match, so Host B does not process the request.

正如您所看到的,需要一种方法来告诉代理 ARP 服务器它的两个接口位于同一广播域,因此这两个接口都不应该被处理ARPOP_REQUEST。这是通过将称为介质 ID的 ID 分配给连接到同一 LAN 的接口来完成的。在这种情况下,应将相同的介质 ID 分配给 主机 B 上的eth0eth1。仅当所请求的地址可通过介质 ID 与入口关联的设备不同的设备到达时,主机才会回复入口请求请求。设备。介质 ID 为正数;其他值具有特殊含义,如下所示表28-3

As you can see, there is a need for a way to tell the proxy ARP server that its two interfaces reside on the same broadcast domain, and that therefore neither of the two ARPOP_REQUESTs should be processed. This is done by assigning an ID called the medium ID to interfaces connected to the same LAN. In this case, the same medium ID should be assigned to both eth0 and eth1 on Host B. A host replies to an ingress solicitation request only when the solicited address is reachable through a device with a medium ID different from the one associated with the ingress device. Medium IDs are positive numbers; other values have special meanings as shown in Table 28-3.

表28-3。介质ID值

Table 28-3. Value of medium ID

价值

Value

意义

Meaning

-1

-1

代理 ARP 已禁用。

Proxy ARP is disabled.

0(默认)

0 (default)

中 ID 功能已禁用。

Medium ID feature is disabled.

>0

>0

有效介质 ID。

Valid medium ID.

在图28-9的拓扑中,Host B的接口连接到两个不同的LAN,因此不需要配置Medium ID 。有关介质 ID 的用途和使用的详细信息,请参阅http://www.ssi.bg/~ja/medium_id.txt

The medium ID configuration is not necessary in the topology in Figure 28-9, where Host B's interfaces are connected to two different LANs. For details on the purpose and use of medium IDs, see http://www.ssi.bg/~ja/medium_id.txt.

重新设计图28-8中的拓扑,不需要介质ID配置

图 28-9。重新设计图28-8中的拓扑,不需要介质ID配置

Figure 28-9. Redesign of the topology in Figure 28-8 that does not need the medium ID configuration

图 28-10显示了 实现的逻辑arp_fwd_proxy,以及 调用的例程,以arp_process查看是否可以根据代理 ARP 和Medium ID 配置来代理给定的 ARP 请求。

Figure 28-10 shows the logic implemented by arp_fwd_proxy, the routine invoked by arp_process to see whether a given ARP request can be proxied based on the proxy ARP and Medium ID configuration.

ARP协议初始化

ARP Protocol Initialization

arp_initARP 协议在net/ipv4/arp.c中初始化。

The ARP protocol is initialized by arp_init in net/ipv4/arp.c.

通用协议初始化的框架第 27 章的“协议初始化和清理”部分显示了例程。在本章中,我们将研究什么是特定于 ARP 的。

The skeleton of a general protocol initialization routine was shown in the section "Protocol Initialization and Cleanup" in Chapter 27. In this chapter we'll examine what is ARP-specific.

该函数的第一步是注册一个ARP使用的虚函数和其他通用参数表;这是由 完成的neigh_table_init。表 的内容arp_tbl将在下一节中描述。

The first step in the function is to register a table of virtual functions and other general parameters used by ARP; this is done by neigh_table_init. The contents of the table, arp_tbl, are described in the next section.

我们在第 13 章中看到了如何dev_add_pack用于安装协议处理程序。如果您还记得该例程的使用方式,那么从以下定义arp_packet_type应该可以清楚地看出,ARP 数据包将由该函数处理arp_rcv(在同一 net/ipv4/arp.c文件中定义)。

We saw in Chapter 13 how dev_add_pack is used to install a protocol handler. If you remember how that routine is used, from the following definition of arp_packet_type it should be clear that ARP packets will be processed by the arp_rcv function (defined in the same net/ipv4/arp.c file).

静态结构数据包类型 arp_packet_type = {
    .type: _ _constant_htons(ETH_P_ARP),
    .func: arp_rcv,
};
static struct packet_type arp_packet_type = {
    .type:    _ _constant_htons(ETH_P_ARP),
    .func:    arp_rcv,
};

arp_proc_init创建/proc/net/arp文件,可以读取该文件来查看 ARP 缓存的内容(包括代理目标)。

arp_proc_init creates the /proc/net/arp file, which can be read to see the contents of the ARP cache (including proxy destinations).

arp_fwd_proxy函数

图 28-10。arp_fwd_proxy函数

Figure 28-10. arp_fwd_proxy function

当内核编译为支持 时,会创建sysctl目录/proc/sys/net/ipv4/neigh,neigh_parms以通过neigh_sysctl_register. 请注意,后者的第一个输入参数设置为 NULL,正如我们将在第 29 章的“目录创建”部分中看到的,这意味着调用者 ( ) 想要注册默认目录。arp_init

When the kernel is compiled with support for sysctl, the directory /proc/sys/net/ipv4/neigh is created to export the default tuning parameters of the neigh_parms structure by means of neigh_sysctl_register. Note that the first input parameter to the latter is set to NULL, which, as we will see in the section "Directory creation," in Chapter 29, means that the caller (arp_init) wants to register the default directory.

register_netdevice_notifier向内核注册回调函数以接收有关设备配置和状态更改的通知。有关详细信息,请参阅“外部事件”部分。

register_netdevice_notifier registers a callback function with the kernel to receive notifications about changes to the configurations and status of devices. See the section "External Events" for more details.

arp_tbl 表

The arp_tbl Table

这是基本数据结构,包含 ARP 协议所涉及的基本变量。类型结构的作用已在第 27 章“主要数据结构neigh_table部分中描述。ARP 初始化其表如下:

This is the basic data structure that contains essential variables to which the ARP protocol refers. The role of the structure, which is of type neigh_table, was described in the section "Main Data Structures" in Chapter 27. ARP initializes its table as follows:

结构 neigh_table arp_tbl = {
    .family: AF_INET,
    .entry_size: sizeof(struct neighbour) + 4,
    .key_len: 4,
    .hash:arp_hash,
    .构造函数:arp_构造函数,
    .proxy_redo:parp_redo,
    .id: "arp_cache",
    .parms: {
        .tbl:&arp_tbl,
        .base_reachable_time:30 * HZ,
        .retrans_time: 1 * HZ,
        .gc_staletime:60 * HZ,
        .reachable_time: 30 * HZ,
        .delay_probe_time:5 * HZ,
        .queue_len: 3,
        .ucast_probes:3,
        .mcast_probes:3,
        .anycast_delay: 1 * HZ,
        .proxy_delay: (8 * HZ) / 10,
        .proxy_qlen:64,
        .锁定时间:1 * HZ,
    },
    .gc_间隔:30 * HZ,
    .gc_thresh1:128,
    .gc_thresh2:512,
    .gc_thresh3:1024,
};
struct neigh_table arp_tbl = {
    .family:      AF_INET,
    .entry_size:  sizeof(struct neighbour) + 4,
    .key_len:     4,
    .hash:        arp_hash,
    .constructor: arp_constructor,
    .proxy_redo:  parp_redo,
    .id:          "arp_cache",
    .parms: {
        .tbl:            &arp_tbl,
        .base_reachable_time: 30 * HZ,
        .retrans_time:        1 * HZ,
        .gc_staletime:        60 * HZ,
        .reachable_time:      30 * HZ,
        .delay_probe_time:    5 * HZ,
        .queue_len:           3,
        .ucast_probes:        3,
        .mcast_probes:        3,
        .anycast_delay:       1 * HZ,
        .proxy_delay:         (8 * HZ) / 10,
        .proxy_qlen:          64,
        .locktime:            1 * HZ,
    },
    .gc_interval:   30 * HZ,
    .gc_thresh1:    128,
    .gc_thresh2:    512,
    .gc_thresh3:    1024,
};

作为这些字段重要性的一个例子,该字段的值(在第 29 章的“ neigh_parms 结构base_reachable_time一节中描述)表明,仅当最后一个可达性证明在最后 30 秒内到达时,ARP 才会考虑条目。类似地,该字段(在同一节中描述)指示如果没有收到对请求的答复,则将在 1 秒后发送新的答复。NUD_REACHABLEretrans_time

As an example of the significance of these fields, the value of the base_reachable_time field (described in the section "neigh_parms Structure" in Chapter 29) indicates that ARP considers an entry NUD_REACHABLE only if the last proof of reachability arrived within the last 30 seconds. Similarly, the retrans_time field (described in the same section) indicates that if no reply is received to a solicitation, a new one will be sent after 1 second.

在以下部分中,我们将研究hashconstructorproxy_redo方法。我们还将了解如何arp_rcv处理入口 ARP 数据包。

In the following sections, we'll examine the hash, constructor, and proxy_redo methods. We will also see how arp_rcv processes ingress ARP packets.

邻居结构的初始化

Initialization of a neighbour Structure

正如我们在前面的章节中看到的,neighbour 结构存储执行相邻协议的工作(将 L3 地址转换为 L2 地址)所需的所有信息,以实现单个 L3 到 L2 地址映射。每个协议都指定了用于neighbour在其neigh_table->constructor虚拟函数中创建结构的函数。ARP的初始化arp_tbl从上一节结构体的定义中可以看出,函数是arp_constructor

As we saw in earlier chapters, a neighbour structure stores all of the information needed to perform the neighboring protocol's job—translating an L3 address to an L2 address—for a single L3-to-L2 address mapping. Each protocol specifies the function used to create a neighbour structure in its neigh_table->constructor virtual function. ARP's initialization function, as you can see from the definition of the arp_tbl structure in the previous section, is arp_constructor.

基本初始化序列

Basic Initialization Sequence

图28-11显示了创建邻居条目的基本步骤。

Figure 28-11 shows the basic steps in creating a neighbor entry.

新邻居结构的初始化序列

图 28-11。新邻居结构的初始化序列

Figure 28-11. Initialization sequence for a new neighbour structure

我们在第27章的“创建邻居条目”一节中看到,可以出于不同的原因和不同的初始状态创建邻居。因此,分配给结构字段的默认值可以被调用者覆盖。例如,设置为由于向相关邻居发送传输请求而创建结构的时间。但也可能是该条目是从命令行创建的。广播和多播 IP 地址不需要 ARP 的任何帮助即可转换为 L2 广播或多播对应地址,因此在这些情况下,neighbourneigh->nud_stateNUD_NONEneighbourNUD_PERMANENTNUD_STALEnud_state设置为NUD_NOARP.

We saw in the section "Creating a neighbour Entry" in Chapter 27 that a neighbor can be created for different reasons and in different initial states. Because of that, the default values assigned to the fields of the neighbour structure can be overridden by the caller. For instance, neigh->nud_state is set to NUD_NONE when the neighbour structure is created as a consequence of a transmission request toward the associated neighbor. But it could also be NUD_PERMANENT or NUD_STALE if the entry was created from the command line. Broadcast and multicast IP addresses do not need any help from ARP to be translated to an L2 broadcast or multicast corresponding address, so in these cases, nud_state is set to NUD_NOARP.

特别重要的是以下字段的初始化:

Particularly important are the initializations of the following fields:

裸体状态
nud_state

neighbour 结构的初始状态取决于 L3 地址的类型以及创建条目的原因。

The initial state of the neighbour structure depends on the type of L3 address and the reason the entry was created.

输出
output

output根据分配给 的值进行初始化nud_state

output is initialized based on the value assigned to nud_state.

ha

该字段是 L2 地址,这是 ARP 协议要发现的地址。再次强调,第 26 章“特殊情况”部分中描述的地址不需要 ARP ,并且可以立即从 L3 地址导出 L2 地址。

This field is the L2 address, which is what the ARP protocol is there to discover. Once again, ARP is not needed for the addresses described in the section "Special Cases" in Chapter 26, and the L2 address can be derived from the L3 address right away.

操作
ops

如第 27 章所述,虚拟功能的集合决定了 IP 子系统调用的操作。图 28-12arp_constructor总结了 ARP(更准确地说, )用于选择要使用的实例(在net/ipv4/arp.cneigh_ops中定义的四个实例中)的 标准。

As described in Chapter 27, this collection of virtual functions determines the operations invoked by the IP subsystem. Figure 28-12 summarizes the criteria used by ARP (more exactly, arp_constructor) to select which instance of neigh_ops to use, among the four defined in net/ipv4/arp.c.

当未显式设置 by 时neigh_create,刚刚描述的字段继承由 分配的值,该值在调用虚函数之前neigh_alloc被调用。neigh_createconstructor

When not explicitly set by neigh_create, the fields just described inherit the values assigned by neigh_alloc, which is called by neigh_create before it invokes the constructor virtual function.

arp_constructor 中 neigh->ops 的初始化

图 28-12。arp_constructor 中 neigh->ops 的初始化

Figure 28-12. Initialization of neigh->ops in arp_constructor

ops 领域的虚拟功能

Virtual Functions in the ops Field

在第 27 章的“ L3 协议和邻居协议之间的通用接口”部分中,我们在net/core/neighbour.c中看到了邻居协议使用的 、、和方法对邻居基础设施提供的功能的概述。在本章中,我们重点关注 ARP 协议提供的内容。可以分配给的四组方法(取决于邻居的状态)是:outputconnected_outputhh_outputqueue_xmitneigh->ops

In the section "Common Interface Between L3 Protocols and Neighboring Protocols" in Chapter 27, we saw an overview of the functions provided by the neighboring infrastructure in net/core/neighbour.c for the output, connected_output, hh_output, and queue_xmit methods used by a neighboring protocol. In this chapter, we focus on the ones provided by the ARP protocol. The four sets of methods that can be assigned to neigh->ops (depending on the state of the neighbour) are:

静态结构 neigh_ops arp_generic_ops = {
    .family: AF_INET,
    .solicit: arp_solicit,
    .error_report:arp_error_report,
    .输出:neigh_resolve_output,
    .connected_output:neigh_connected_output,
    .hh_output:dev_queue_xmit,
    .queue_xmit:dev_queue_xmit,
};

静态结构 neigh_ops arp_hh_ops = {
    .family: AF_INET,
    .solicit: arp_solicit,
    .error_report:arp_error_report,
    .输出:neigh_resolve_output,
    .connected_output:neigh_resolve_output,
    .hh_output:dev_queue_xmit,
    .queue_xmit:dev_queue_xmit,
};

静态结构 neigh_ops arp_direct_ops = {
    .family: AF_INET,
    .输出:dev_queue_xmit,
    .connected_output:dev_queue_xmit,
    .hh_output:dev_queue_xmit,
    .queue_xmit:dev_queue_xmit,
};

结构 neigh_ops arp_broken_ops = {
    .family: AF_INET,
    .solicit: arp_solicit,
    .error_report:arp_error_report,
    .输出:neigh_compat_output,
    .connected_output:neigh_compat_output,
    .hh_output:dev_queue_xmit,
    .queue_xmit:dev_queue_xmit,
};
static struct neigh_ops arp_generic_ops = {
    .family:            AF_INET,
    .solicit:           arp_solicit,
    .error_report:      arp_error_report,
    .output:            neigh_resolve_output,
    .connected_output:  neigh_connected_output,
    .hh_output:         dev_queue_xmit,
    .queue_xmit:        dev_queue_xmit,
};

static struct neigh_ops arp_hh_ops = {
    .family:            AF_INET,
    .solicit:           arp_solicit,
    .error_report:        arp_error_report,
    .output:            neigh_resolve_output,
    .connected_output:    neigh_resolve_output,
    .hh_output:         dev_queue_xmit,
    .queue_xmit:        dev_queue_xmit,
};

static struct neigh_ops arp_direct_ops = {
    .family:            AF_INET,
    .output:            dev_queue_xmit,
    .connected_output:    dev_queue_xmit,
    .hh_output:         dev_queue_xmit,
    .queue_xmit:        dev_queue_xmit,
};

struct neigh_ops arp_broken_ops = {
    .family:            AF_INET,
    .solicit:           arp_solicit,
    .error_report:        arp_error_report,
    .output:            neigh_compat_output,
    .connected_output:    neigh_compat_output,
    .hh_output:         dev_queue_xmit,
    .queue_xmit:        dev_queue_xmit,
};

具有特定于 ARP 的初始化的三个字段在所有四个实例中都设置为相同的值neigh_ops(除非arp_direct_ops不需要某些字段,因此省略了它们的定义)。

The three fields with an ARP-specific initialization are set to the same value in all four neigh_ops instances (except that arp_direct_ops does not need some of the fields, and therefore omits their definitions).

family
family

AF_INET只是表明 ARP 适用于 IPv4。

AF_INET simply indicates that ARP works with IPv4.

solicit
solicit

arp_solicit被调用来生成请求请求,要么是为了第一次解析地址,要么是为了确认已经在缓存中的地址。在后一种情况下,它是由定时器到期触发的,如第 27 章“定时器”部分所述。

传输是通过 完成的,这在“传输 ARP 数据包:arp_send 简介arp_send”一节中进行了描述。

arp_solicit is called to generate a solicitation request, either to resolve an address for the first time or to confirm one that is already in the cache. In the latter case, it is triggered by the expiration of a timer, as discussed in the section "Timers" in Chapter 27.

The transmission is done with arp_send, which is described in the section "Transmitting ARP Packets: Introduction to arp_send."

error_report
error_report

arp_error_report当 ARP 事务出现错误时通知上层网络层。请参阅“生成的事件”部分。

arp_error_report notifies the upper networking layers when there is an error in an ARP transaction. See the section "Generated Events."

arp_constructor 函数的开始

Start of the arp_constructor Function

第一个任务是从与邻居关联的设备中arp_constructor检索结构。in_dev该结构体存储了网络设备的IP层配置,其中也包括ARP配置信息。如果不存在,则与邻居关联的设备没有 IP 配置,因此使用 ARP 没有意义;因此该函数会因错误而终止。

The first task of arp_constructor is to retrieve an in_dev structure from the device associated with the neighbor. This structure stores the IP layer configuration of the network device, which includes the ARP configuration information, too. If it does not exist, the device the neighbor is associated with does not have an IP configuration and therefore the use of ARP does not make sense; the function therefore terminates with an error.

如果设备有IP配置,则ARP配置信息neighbour通过neigh->parms指针存储在条目中。

If the device has an IP configuration, the ARP configuration information is stored in the neighbour entry via the neigh->parms pointer.

static int arp_constructor(结构邻居*neigh)
{
    u32 addr = *(u32*)neigh->primary_key;
    struct net_device *dev = neigh->dev;
    结构 in_device *in_dev;
    结构 neigh_parms *parms

    neigh->type = inet_addr_type(addr);

    rcu_read_lock();
    in_dev = rcu_dereference(_ _in_dev_get(dev));
    如果(in_dev == NULL){
        rcu_read_unlock();
        返回-EINVAL;
    }

    parms = in_dev->arp_parms;
    _ _neigh_parms_put(neigh->parms);
    neigh->parms = neigh_parms_clone(parms);

    rcu_read_unlock();

    if (dev->hard_header == NULL) {
        /* Case1: 不使用L2的设备 */
    } 别的 {
        /* Case2: 确实使用 L2 的设备 */
    }
static int arp_constructor(struct neighbour *neigh)
{
    u32 addr = *(u32*)neigh->primary_key;
    struct net_device *dev = neigh->dev;
    struct in_device *in_dev;
    struct neigh_parms *parms

    neigh->type = inet_addr_type(addr);

    rcu_read_lock( );
    in_dev = rcu_dereference(_ _in_dev_get(dev));
    if (in_dev == NULL) {
        rcu_read_unlock( );
        return -EINVAL;
    }

    parms = in_dev->arp_parms;
    _ _neigh_parms_put(neigh->parms);
    neigh->parms = neigh_parms_clone(parms);

    rcu_read_unlock( );

    if (dev->hard_header == NULL) {
        /* Case1: Device that does not use L2 */
    } else {
        /* Case2: Device that does use L2 */
    }

以下步骤取决于设备驱动程序是否提供 L2 协议标头dev->hard_header

The following steps depend on whether the device driver provides an L2 protocol header, dev->hard_header.

不需要ARP的设备

Devices That Do Not Need ARP

dev->hard_header没有设置时,意味着设备驱动程序没有提供填充L2头的功能。这又意味着该设备没有 L2 标头,因此该条目的状态neighbour 应设置为NUD_NOARP。此外,neigh_ops被初始化为arp_direct_ops,它由一个结构体组成,neigh_ops所有函数都被初始化为dev_queue_xmit: 因为不需要相邻协议,arp_direct_ops所以直接进入下层。

When dev->hard_header is not set, it means that the device driver does not provide a function to fill in the L2 header. This in turn means that the device does not have an L2 header, so the state of the neighbour entry should be set to NUD_NOARP. Moreover, neigh_ops is initialized to arp_direct_ops, which consists of a neigh_ops structure with all the functions initialized to dev_queue_xmit: because there is no need for a neighboring protocol, arp_direct_ops simply goes straight to the lower layer.

    嘶->nud_state = NUD_NOARP;
    neigh->ops = &arp_direct_ops;
    邻->输出=邻->操作->queue_xmit;
    neigh->nud_state = NUD_NOARP;
    neigh->ops = &arp_direct_ops;
    neigh->output = neigh->ops->queue_xmit;

注意,neigh->ops不一定 arp_direct_ops每次都state设置为NUD_NOARP。在某些情况下,例如 IP 广播地址,L2 层使用标头(dev->hard_header不为 NULL),但不需要 ARP。

Note that neigh->ops is not necessarily set to arp_direct_ops every time state is set to NUD_NOARP. There are cases, such as IP broadcast addresses, where the L2 layer uses a header (dev->hard_header is not NULL) but ARP is not needed.

另请注意,neigh->ha未初始化,因为不需要它。

Also note that neigh->ha is not initialized, because it is not needed.

需要ARP的设备

Devices That Need ARP

如图28-12所示,一旦 arp_constructor确定设备需要ARP,它就必须进一步区分驱动程序已更新到新的相邻基础设施的设备和仍然使用旧基础设施的设备(请参阅net中标记为过时的功能) /ipv4/arp.c)。

As shown in Figure 28-12, once arp_constructor establishes that a device needs ARP, it has to further differentiate between devices whose drivers have been updated to the new neighboring infrastructure and those that still use the old one (see the functions noted as obsolete in net/ipv4/arp.c).

设备类型由 ARP 标头类型标识,该标头类型是include/linux/if_arp.hARPHDR_ XXX中包含的值列表。 使用这些类型来区分旧式和新型驱动程序。arp_constructor

Device types are identified by the ARP header type, a list of ARPHDR_ XXX values included in include/linux/if_arp.h. arp_constructor uses these types to distinguish between old- and new-style drivers.

目前,只有业余无线电设备和一些广域网卡仍在使用旧代码。对于这些,neigh->ops被初始化为 arp_broken_ops,它由基于旧代码的虚函数组成。

At the moment, only the amateur radio devices and some WAN cards are still using the old code. For these, neigh->ops is initialized to arp_broken_ops, which consists of virtual functions based on the old code.

        开关(开发->类型){
        默认:
            休息;
        案例 ARPHRD_ROSE:
#如果已定义(CONFIG_AX25) || 定义(CONFIG_AX25_MODULE)
        案例 ARPHRD_AX25:
#如果已定义(CONFIG_NETROM) || 定义(CONFIG_NETROM_MODULE)
        案例 ARPHRD_NETROM:
#万一
            邻居->ops = &arp_broken_ops;
            邻->输出=邻->操作->输出;
            返回0;
#万一
        ;}
#万一
        switch (dev->type) {
        default:
            break;
        case ARPHRD_ROSE:
#if defined(CONFIG_AX25) || defined(CONFIG_AX25_MODULE)
        case ARPHRD_AX25:
#if defined(CONFIG_NETROM) || defined(CONFIG_NETROM_MODULE)
        case ARPHRD_NETROM:
#endif
            neigh->ops = &arp_broken_ops;
            neigh->output = neigh->ops->output;
            return 0;
#endif
        ;}
#endif

对于其他设备,内核neigh->ops根据设备驱动程序的功能进行初始化。如果设备驱动程序提供了管理 L2 标头缓存的功能 ( dev->hard_header_cache),arp_hh_ops 则使用。arp_generic_ops否则,选择通用。要了解给定设备是否提供此服务,请查看该设备的关联功能(例如,对于以太网卡,如第 8 章所述)。xxx _setupether_setup

For other devices, the kernel initializes neigh->ops based on the capabilities of the device driver. If the device driver provides a function to manage L2 header caching (dev->hard_header_cache), arp_hh_ops is used. Otherwise, the generic arp_generic_ops is selected. To know whether a given device provides this service, look at the device's associated xxx _setup function (e.g., ether_setup for Ethernet cards, as described in Chapter 8).

    if (dev->hard_header_cache)
        邻居->ops = &arp_hh_ops;
    别的
        邻居->ops = &arp_generic_ops;
    if (dev->hard_header_cache)
        neigh->ops = &arp_hh_ops;
    else
        neigh->ops = &arp_generic_ops;

如前面“基本初始化序列neigh->output”部分所述,的初始化取决于。例如,当 结构体准备好使用时(), 函数可以直接初始化为 。有关 的更多详细信息, 请参阅第 27 章中的“用于 neigh->output 的例程”部分。nud_stateneighbourNUD_VALIDoutputconnected_outputneigh->output

The initialization of neigh->output, as described earlier in the section "Basic Initialization Sequence," depends on nud_state. For example, when the neighbour structure is ready to be used (NUD_VALID), the output function can be initialized directly to connected_output. See the section "Routines used for neigh->output" in Chapter 27 for more details about neigh->output.

        if (neigh->nud_state&NUD_VALID)
            邻->输出=邻->操作->connected_output;
        别的
            邻->输出=邻->操作->输出;
        if (neigh->nud_state&NUD_VALID)
            neigh->output = neigh->ops->connected_output;
        else
            neigh->output = neigh->ops->output;

环回设备(lo)和配置了该标志的设备IFF_NOARP不需要使用 ARP 来解析地址。然而,由于相邻子系统仍然需要一个地址来放入 L2 标头中,因此该函数会分配与该设备关联的地址。

The loopback device (lo) and devices configured with the IFF_NOARP flag do not need to use ARP to resolve the address. However, because the neighboring subsystem still needs an address to put into the L2 header, this function assigns the one associated with the device.

    neigh->type = inet_addr_type(addr);
    …………
    if (neigh->type == RTN_MULTICAST) {
        嘶->nud_state = NUD_NOARP;
        arp_mc_map(addr, neigh->ha, dev, 1);
    } else if (dev->flags&(IFF_NOARP|IFF_LOOPBACK)) {
        嘶->nud_state = NUD_NOARP;
        memcpy(neigh->ha, dev->dev_addr, dev->addr_len);
    } else if (neigh->type == RTN_BROADCAST || dev->flags&IFF_POINTOPOINT) {
        嘶->nud_state = NUD_NOARP;
        memcpy(neigh->ha, dev->广播, dev->addr_len);
    }
    neigh->type = inet_addr_type(addr);
    ... ... ...
    if (neigh->type == RTN_MULTICAST) {
        neigh->nud_state = NUD_NOARP;
        arp_mc_map(addr, neigh->ha, dev, 1);
    } else if (dev->flags&(IFF_NOARP|IFF_LOOPBACK)) {
        neigh->nud_state = NUD_NOARP;
        memcpy(neigh->ha, dev->dev_addr, dev->addr_len);
    } else if (neigh->type == RTN_BROADCAST || dev->flags&IFF_POINTOPOINT) {
        neigh->nud_state = NUD_NOARP;
        memcpy(neigh->ha, dev->broadcast, dev->addr_len);
    }

发送和接收 ARP 数据包

Transmitting and Receiving ARP Packets

用于发送和接收ARP数据包的函数是:

The functions used to send and receive ARP packets are:

arp_send
arp_send

相邻子系统呼叫neigh_ops->solicit传送请求请求。对于 ARP,solicit函数 ( arp_solicit) 是 的简单包装arp_sendarp_send 填充 ARP 标头和有效负载,并使用该dev_queue_xmit函数传输请求。

The neighboring subsystem calls neigh_ops->solicit to transmit a solicitation request. In the case of ARP, the solicit function (arp_solicit) is a simple wrapper around arp_send. arp_send fills in the ARP header and payload and uses the dev_queue_xmit function to transmit the request.

arp_rcv
arp_rcv

因为 ARP 本身就是一个协议(与 IPv6 的 ND 不同),所以它在arp_init. 下一节将 arp_rcv详细介绍两种主要 ARP 数据包类型的处理方式。

Because ARP is a protocol in its own right (unlike ND for IPv6), it registers a handler in arp_init. The next section describes arp_rcv in detail, along with how the two main ARP packet types are processed.

如图28-13所示,ARP报文的发送和接收都可以通过Netfilter来控制。

As shown in Figure 28-13, both transmission and reception of ARP packets can be controlled by Netfilter.

arp_rcv和之间的虚线arp_send表示在某些情况下,ARP 数据包的接收会触发至少一个其他 ARP 数据包的传输。出现这种情况时:

The dotted lines between arp_rcv and arp_send indicate that in some cases, the reception of an ARP packet triggers the transmission of at least one other ARP packet. This occurs when:

  • 桥接已配置。收到 ARP 数据包的网桥可能只是将其转发到其他网桥接口,而不对其进行处理。

  • Bridging is configured. A bridge receiving an ARP packet may just forward it to other bridge interfaces without processing it.

  • 入口数据包是一个ARPOP_REQUEST,相邻子系统决定它可以根据其配置进行回复。子系统生成一个ARPOP_REPLY.

  • The ingress packet is an ARPOP_REQUEST and the neighboring subsystem decides it can reply according to its configuration. The subsystem generates an ARPOP_REPLY.

如图28-13所示,arp_send也由外部事件和一些内核功能(例如bonding)触发;详细信息将在后面的部分中提供。

As Figure 28-13 shows, arp_send is also triggered by external events and a few kernel features such as bonding; details are provided in later sections.

发送ARP数据包:arp_send简介

Transmitting ARP Packets: Introduction to arp_send

arp_send是 ARP 提供的用于传输请求请求和应答的例程,如图28-14所示。第 27 章在独立于协议的层面上解释了邻居基础设施如何处理请求传输和重传。在这里我们将看到它是如何arp_send 完成它的工作的。

arp_send is the routine provided by ARP to transmit both solicitation requests and replies, as shown in Figure 28-14. Chapter 27 explained on a protocol-independent level how the neighbor infrastructure takes care of solicitation transmissions and retransmissions. Here we'll see how arp_send accomplishes its job.

(a) arp_rcv; (b) arp_发送

图 28-13。(a) arp_rcv;(b) arp_发送

Figure 28-13. (a) arp_rcv; (b) arp_send

如图28-13(b)所示, arp_send分为两部分:arp_create初始化ARP数据包,arp_xmit挂钩Netfilter,然后调用dev_queue_xmit

As shown in Figure 28-13(b), arp_send is split into two parts: arp_create initializes the ARP packet, and arp_xmit hooks into Netfilter and then invokes dev_queue_xmit.

arp_send分为这两部分,以便需要操作数据包(例如,通过插入额外标头)的驱动程序可以分别调用arp_createarp_xmit。因此,驱动程序可以在两者之间执行一些定制。例如,请参阅绑定代码如何在需要时添加 802.1Q 标记(rlb_update_clientdrivers/net/bonding/bond_alb.c中)。

arp_send is split into these two parts so that drivers that need to manipulate a packet—for instance, by inserting extra headers—can call arp_create and arp_xmit separately. The driver can thus perform some customization in between. See, for example, how the bonding code manages to add the 802.1Q tag if needed (rlb_update_client in drivers/net/bonding/bond_alb.c).

使用 arp_send 的上下文示例

图 28-14。使用 arp_send 的上下文示例

Figure 28-14. Examples of contexts where arp_send is used

征集

Solicitations

我们在第 27 章的“创建邻居条目”一节中看到,内核可能需要生成请求请求的时间。本节我们分析一下ARP使用的例程arp_solicit来完成这个任务。

We saw in the section "Creating a neighbour Entry" in Chapter 27, the times when the kernel may need to generate a solicitation request. In this section, we analyze arp_solicit, the routine used by ARP to accomplish this task.

的调用者arp_solicit负责计算进行的探测(请求传输尝试)的数量并确保尚未达到最大值。arp_solicit因此,不必担心此任务。

The caller of arp_solicit is responsible for counting the number of probes (solicitation transmission attempts) made and ensuring that the maximum has not yet been reached. arp_solicit, therefore, doesn't have to worry about this task.

下面是它的原型以及两个输入参数的含义:

Here is its prototype and the meaning of the two input parameters:

static void arp_solicit(结构邻居* neigh,结构sk_buff * skb)
static void arp_solicit(struct neighbour * neigh, struct sk_buff *skb)
neigh
neigh

需要解析 L3 地址的邻居。

Neighbor whose L3 address needs to be resolved.

skb
skb

保存数据包的缓冲区,其传输尝试触发了请求请求的生成。

Buffer holding the data packets whose transmission attempts triggered the generation of the solicitation request.

要理解 的实现arp_solicit,了解以下两组参数之间的关系和区别非常重要:

To understand the implementation of arp_solicit, it is important to understand the relationships and differences between the following two groups of parameters:

  • 缓冲区的IP头中的源IP地址,以及由选择放入ARP头中的skb源IP地址( ARP头格式见图28-1 )。arp_solicit

    当流量在本地生成时,IP 标头中的源 IP 地址对于系统来说是本地的。当数据包被转发时,源IP地址是原始发送者的IP地址。

  • The source IP addresses in the IP header of the skb buffer, and the source IP address selected by arp_solicit to put in the ARP header (see Figure 28-1 for the ARP header format).

    When the traffic is generated locally, the source IP address in the IP header is local to the system. When the packet is being forwarded, the source IP address is that of the original sender.

  • 缓冲区IP头中的目的IP地址,以及要求解析的skb目的IP地址( )。arp_solicitneigh->primary_key

    ARP 要求解析的地址是用于路由的下一跳地址skb。仅当下一跳也是最终目的地时,才与 IP 标头中的目的地 IP 地址匹配。

  • The destination IP address in the IP header of the skb buffer, and the destination IP address that arp_solicit is asked to resolve (neigh->primary_key).

    The address that ARP is asked to resolve is the address of the next hop used to route skb. This matches the destination IP address in the IP header only when the next hop is also the final destination.

主要任务arp_solicit是:

The main tasks of arp_solicit are:

  • 选择要放入 ARP 标头中的源 IP 地址。这可能会受到“ /proc 选项ARP_ANNOUNCE”部分中提到的配置的影响。图 28-15显示了内部结构 ,特别是如何选择源 IP 地址。arp_solicit

  • Select the source IP address to put in the ARP header. This can be influenced by the ARP_ANNOUNCE configuration mentioned in the section "/proc Options." Figure 28-15 shows the internals of arp_solicit and in particular how the source IP address is selected.

  • 更新生成的招标请求数量。

  • Update the number of solicitation requests generated.

  • 使用 传输请求arp_send

  • Transmit the solicitation using arp_send.

下一节将详细介绍源 IP 地址的选择。让我们简单看看另外两个任务是如何完成的。

The next section will go into detail on the selection of the source IP address. Let's briefly see how the other two tasks are accomplished.

arp_solicit区分应由内核生成的请求和应从用户空间生成的请求。后者可能在arpd ARP 守护程序运行时发生;这要求使用 ARPD 选项编译内核,这将在“ ARPD ”部分中讨论。两案处理如下:

arp_solicit differentiates between requests that should be generated by the kernel and requests that should be generated from user space. The latter can happen when an arpd ARP daemon is running; this requires that the kernel be compiled with the ARPD option, and is discussed in the section "ARPD." The two cases are handled as follows:

  • 对于内核生成的请求,请求通过 进行传输arp_send

  • For kernel-generated requests, the solicitation is transmitted with arp_send.

  • 对于用户空间请求,arp_solicit调用 来通知neigh_app_ns感兴趣的用户空间应用程序需要生成请求请求。如果内核尚未编译为支持 ARPD,arp_solicit则仅返回而不发出请求请求。

  • For user-space requests, arp_solicit makes a call to neigh_app_ns to notify the interested user-space application about the need to generate a solicitation request. If the kernel has not been compiled with support for ARPD, arp_solicit simply returns without making the solicitation request.

ARP_ANNOUNCE 和源 IP 地址的选择

ARP_ANNOUNCE and selection of source IP address

大多数主机只有一个 IP 地址,因此可以将其复制到 ARP 标头中。当主机提供多个 IP 地址时,选择可能会受到 的影响ARP_ANNOUNCE只需应用表 28-1图 28-15arp_solicit中描述的逻辑即可。为了完成其工作,它利用路由和配置子系统提供的三个例程:

Most hosts have just one IP address, so this can be copied into the ARP header. When a host offers multiple IP addresses, the choice can be influenced by ARP_ANNOUNCE. arp_solicit simply applies the logic described in Table 28-1 and depicted in Figure 28-15. In order to accomplish its job, it makes use of three routines made available by the routing and configuration subsystems:

inet_addr_type
inet_addr_type

给定输入中的 IP 地址,此函数返回地址类型。在本章的上下文中,我们感兴趣的是 value RTN_LOCAL,它表示属于本地主机的地址。

Given an IP address in input, this function returns the address type. In the context of this chapter, we are interested in the value RTN_LOCAL, which indicates an address that belongs to the local host.

inet_addr_onlink
inet_addr_onlink

给定一个设备和两个 IP 地址,此函数检查这两个地址是否属于同一子网。

Given a device and two IP addresses, this function checks whether the two addresses belong to the same subnet.

inet_select_addr
inet_select_addr

给定设备、IP 地址(通常不是系统本地的)和范围,此函数会在设备配置中搜索与入口地址位于同一子网内且范围相同或更小的 IP 地址。范围通常涵盖站点、链接或主机。输入地址 0 使得输入设备上配置的任何主地址都可以选择。您可以在第 30 章中找到更详细的描述。

Given a device, an IP address (usually not local to the system), and a scope, this function searches the device configuration for an IP address that falls within the same subnet as the ingress address and with a scope that is the same or smaller. The scope typically covers a site, a link, or a host. An input address of 0 makes any primary address configured on the input device eligible for selection. You can find a more detailed description in Chapter 30.

请注意,当 ARP_ANNOUNCE 配置为级别 0 或 1 并且 IP 标头中的源 IP 地址无法使用时,将arp_solicit回退到级别inet_select_addr2 RT_SCOPE_LINK。给定一个设备dev 和一个目标 IP 地址IP,浏览devinet_select_addr上配置的 IP 地址 ,并选择第一个与目标 IP 地址IP的子网匹配且范围大于或等于 的 IP 地址。第 30 章的“范围”部分描述了范围。RT_SCOPE_LINK

Note that when ARP_ANNOUNCE is configured at level 0 or 1 and the source IP address in the IP header cannot be used, arp_solicit falls back to level 2. inet_select_addr is invoked with a scope of RT_SCOPE_LINK. Given a device dev and a target IP address IP, inet_select_addr browses the IP addresses configured on dev and selects the first one that matches the subnet of the target IP address IP and has a scope greater than or equal to RT_SCOPE_LINK. Scopes are described in the section "Scope" in Chapter 30.

arp_solicit中源IP的选择

图 28-15。arp_solicit中源IP的选择

Figure 28-15. Selection of source IP in arp_solicit

处理入口 ARP 数据包

Processing Ingress ARP Packets

正如“ ARP 协议初始化”部分所述,ARP 将例程注册arp_rcv 为其协议处理程序。让我们看看这个处理程序如何处理传入的 ARP 数据包。

As explained in the section "ARP Protocol Initialization," ARP registers the arp_rcv routine as its protocol handler. Let's see how this handler processes incoming ARP packets.

skb可以从函数输入参数的缓冲区访问 ARP 数据包;特别是,ARP 标头位于skb->nh.arph. 该函数的首要任务是确保 ARP 数据包没有碎片;也就是说,它可以在内存中线性访问。这项任务是必要的,因为有时skb缓冲区在内存中会出现碎片。[ * ]如果是,arp_rcv则调用通用例程 pskb_may_pull以确保主缓冲区中有足够的空间用于 ARP 标头和有效负载。

The ARP packet can be accessed from the skb buffer that is the function's input argument; in particular, the ARP header is at skb->nh.arph. The function's first task is to make sure the ARP packet is not fragmented; that is, that it can be accessed linearly in memory. This task is necessary because sometimes the skb buffer is fragmented in memory.[*] If it is, arp_rcv calls the generic routine pskb_may_pull to make sure there is enough room in the main buffer for the ARP header and payload.

int arp_rcv(结构 sk_buff *skb, 结构 net_device *dev, 结构 packet_type *pt)
{
    结构arphdr *arp;

    /* ARP header,加上2个设备地址,加上2个IP地址。*/
    if (!pskb_may_pull(skb, (sizeof(struct arphdr) +
                 (2 * dev->addr_len) +
                 (2 * sizeof(u32)))))
        转到 freeskb;
int arp_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt)
{
    struct arphdr *arp;

    /* ARP header, plus 2 device addresses, plus 2 IP addresses.  */
    if (!pskb_may_pull(skb, (sizeof(struct arphdr) +
                 (2 * dev->addr_len) +
                 (2 * sizeof(u32)))))
        goto freeskb;

arp_rcv如果满足以下条件之一,输入的 ARP 数据包将被丢弃:

An input ARP packet is dropped by arp_rcv if one of the following conditions is met:

  • 它是在不使用 ARP 的设备(即标有该标志的设备 IFF_NOARP)上接收的。

    环回接口是此类中的一个特例。发送到环回接口和从环回接口发出的数据包按PACKET_LOOPBACK类型进行分类。由于这样的接口是虚拟的并且没有硬件地址,因此不需要使用ARP。

  • It was received on a device that does not use ARP (i.e., one tagged with the IFF_NOARP flag).

    The loopback interface is a special case within this category. Packets sent to and from the loopback interface are classified with the PACKET_LOOPBACK type. Since such an interface is virtual and does not have a hardware address, there is no need to use ARP.

  • 它不是发往接收接口的(即,目标地址不是接收接口的地址或广播地址)。

  • It was not addressed to the receiving interface (i.e., the destination address was not the receiving interface's address or the broadcast address).

如果缓冲区被共享(即其他人持有对其的引用),则 arp_rcv使用 .clone 克隆缓冲区skb_share_check。克隆是必要的,以确保在处理 ARP 数据包时没有人会更改内容skb(特别是其标头指针)。有关详细信息,请参阅第 2 章中的“克隆和复制缓冲区”部分。

In case the buffer was shared (that is, someone else holds a reference to it), arp_rcv clones the buffer with skb_share_check. Cloning is necessary to make sure that no one will change the content of skb (in particular, its header pointers) while processing the ARP packet. See the section "Cloning and copying buffers" in Chapter 2 for more details.

SIP 和TIP 的含义请参见“ ARP 报文格式”一节。一旦入口 ARP 数据包准备好被处理,假设 Netfilter 没有绑​​架它,arp_process就会处理它,如图28-13所示。

Refer to the section "ARP Packet Format" for the meaning of SIP and TIP. Once an ingress ARP packet is ready to be processed, supposing Netfilter does not kidnap it, arp_process takes care of it, as shown in Figure 28-13.

图 28-16显示了该函数的结构arp_process。它首先对其理解的所有 ARP 数据包类型进行一些通用的健全性检查,然后继续执行特定于特定数据包类型的操作。该函数的最后部分是另一段通用代码,它用新信息更新缓存,除非要更新的条目被锁定(请参阅“最终通用处理”部分)。对组播 IP 地址的请求会被丢弃,因为它们是非法的:我们在第 26 章“特殊情况”一节中看到,组播 IP 地址不需要使用 ARP 来转换为链路层地址。

Figure 28-16 shows the structure of the arp_process function. It starts with some sanity checks common to all the ARP packet types it understands, and then continues with operations specific to particular packet types. The final part of the function is another common piece of code that updates the cache with the new information, unless the entry to update is locked (see the section "Final Common Processing"). Requests for multicast IP addresses are dropped because they are illegal: we saw in the section "Special Cases" in Chapter 26 that multicast IP addresses do not need the use of ARP to be translated to link layer addresses.

初始通用处理

Initial Common Processing

arp_process处理ARPOP_REQUESTARPOP_REPLY数据包类型。任何其他 ARP 数据包类型都会被丢弃。具有多播或广播目标地址的数据包(可以使用LOOPBACKMULTICAST[ * ]进行检测)也会被丢弃,因为它们不需要 ARP,如前面的“ ARP​​ 数据包的目标地址类型”部分和“特殊情况”见第26章

arp_process processes both ARPOP_REQUEST and ARPOP_REPLY packet types. Any other ARP packet type is dropped. Packets with a multicast or broadcast destination address, which can be detected with the LOOPBACK and MULTICAST macros,[*] are also dropped because ARP is not needed for them, as described in the earlier section "Destination Address Types for ARP Packets," and the section "Special Cases" in Chapter 26.

arp_process函数

图 28-16。arp_process函数

Figure 28-16. arp_process function

仅当内核已显式编译并支持某些设备类型时,内核才支持它们。默认情况下不包含它们,因为它们不经常使用,因此内核开发人员决定通过将其支持作为可选来减少内核大小。这里显示的语句switch只是逐一检查这些设备类型(使用 a#ifdef确保每一种都已编译到内核中),并检查 ARP 数据包上指定的协议对于该设备类型是否正确。这部分代码又长又重复​​。

Some device types are supported by the kernel only when it has been explicitly compiled with support for them. They are not included by default because they are not used very often, so the kernel developers decided to reduce the kernel size by making their support optional. The switch statement shown here simply goes one by one through these device types (using a #ifdef to make sure each one has been compiled into the kernel) and checks whether the protocol specified on the ARP packet is correct for that device type. This part of the code is long and repetitive.

    开关(dev_type){
    默认:
        if (arp->ar_pro != htons(ETH_P_IP)) ||
            htons(dev_type) != arp->ar_hrd)
            转到出去;
        休息;
#ifdef CONFIG_NET_ETHERNET
    案例 ARPHRD_ETHER:
    …………
        if (arp->ar_hrd != htons(ARPHRD_ETHER) &&
            arp->ar_hrd != htons(ARPHRD_IEEE802)) ||
            arp->ar_pro != htons(ETH_P_IP))
            转到出去;
        休息;
#万一
#ifdef CONFIG_TR
    案例 ARPHRD_IEEE802_TR:
            …………
#万一
            …………
#万一
    }
    switch (dev_type) {
    default:
        if (arp->ar_pro != htons(ETH_P_IP)) ||
            htons(dev_type) != arp->ar_hrd)
            goto out;
        break;
#ifdef CONFIG_NET_ETHERNET
    case ARPHRD_ETHER:
    ... ... ...
        if (arp->ar_hrd != htons(ARPHRD_ETHER) &&
            arp->ar_hrd != htons(ARPHRD_IEEE802)) ||
            arp->ar_pro != htons(ETH_P_IP))
            goto out;
        break;
#endif
#ifdef CONFIG_TR
    case ARPHRD_IEEE802_TR:
            ... ... ...
#endif
            ... ... ...
#endif
    }

本节的最后一个任务arp_process是从 ARP 标头的字段初始化一些局部变量,以使后面的代码更清晰。这部分功能这里没有展示,但是通过参考图28-1就很容易理解了。arp_ptr指向硬件头的末尾。

The last task in this section of arp_process is to initialize a few local variables from fields of the ARP header to make later code cleaner. This part of the function is not shown here, but is fairly easy to understand by consulting Figure 28-1. arp_ptr points to the end of the hardware header.

处理 ARPOP_REQUEST 数据包

Processing ARPOP_REQUEST Packets

图 28-17ARPOP_REQUEST是如何处理数据包 的高级描述arp_processarp_process处理本地 IP 地址的请求和非本地 IP 地址的请求。后一种情况(即图的左侧)在“代理 ARP ”部分中进行了描述。SIP和TIP的含义如表28-4所示。

Figure 28-17 is a high-level description of how ARPOP_REQUEST packets are processed by arp_process. arp_process processes both requests for local IP addresses and requests for nonlocal IP addresses. The latter case—that is, the left side of the figure—is described in the section "Proxy ARP." Table 28-4 explains the meanings of SIP and TIP.

表 28-4。从ARP数据包中提取的参数

Table 28-4. Parameters extracted from the ARP packet

ARP报文字段

ARP packet field

局部变量名

Local variable name

发送者以太网地址

Sender Ethernet address

sha

sha

发件人IP地址

Sender IP address

sip

sip

目标以太网地址

Target Ethernet address

tha

tha

目标IP地址

Target IP address

tip

tip

ARPOP_REQUEST仅当满足以下所有条件时才会处理An :

An ARPOP_REQUEST is processed only if all of the following are true:

  • 内核知道如何到达发送者请求的地址(即路由表中存在到该地址的有效路由)。

        if (arp->ar_op == htons(ARPOP_REQUEST) &&
            ip_route_input(skb, 提示, sip, 0, dev) == 0) {
            /* 处理数据包 */
        }
    arp_process 处理 ARPOP_REQUEST

    图 28-17。arp_process 处理 ARPOP_REQUEST

    这是过滤掉本地系统不知道的 IP 地址请求的简单方法。当本地系统是主机时,它仅回复对本地接口上配置的 IP 地址的请求。当本地系统是代理 ARP 服务器时,它还会回复对本地接口上配置的任何子网内的 IP 地址(即属于邻居主机的 IP 地址)的请求。

    我们将在第七部分中看到,路由子系统为本地配置的每个 IP 地址在路由表中添加一个条目,并为与每个 IP 地址关联的子网添加一个条目。因此,在这两种情况下,路由查找足以过滤掉本地主机不应回复的那些 IP 地址的请求。

  • The kernel knows how to reach the address requested by the sender (that is, there is a valid route to the address in the routing table).

        if (arp->ar_op == htons(ARPOP_REQUEST) &&
            ip_route_input(skb, tip, sip, 0, dev) == 0) {
            /* Process packet */
        }

    Figure 28-17. ARPOP_REQUEST handling by arp_process

    This is a simple way to filter out requests for IP addresses about which the local system has no knowledge. When the local system is a host, it replies only to requests for IP addresses configured on the local interfaces. When the local system is a proxy ARP server, it also replies to requests for IP addresses that fall within any of the subnets configured on the local interfaces (i.e., IP addresses belonging to neighbor hosts).

    We will see in Part VII that the routing subsystem adds an entry to the routing table for each IP address configured locally, and one for the subnet associated to each of those IP addresses. In both cases, therefore, a routing lookup is sufficient to filter out the requests for those IP addresses the local host should not reply to.

  • 所请求的地址要么在系统上,要么是由该主机作为代理 ARP 主机处理的远程地址。在本节中,我们将讨论由标志标识的本地案例RTN_LOCAL“代理 ARP ”部分描述了远程情况。

  • Either the requested address is on the system, or it is a remote address handled by this host as a proxy ARP host. In this section, we address the local case, identified by the RTN_LOCAL flag. The section "Proxy ARP" describes the remote case.

  • 没有明确禁止传输的配置(请参阅前面的“ ARP_IGNORE ”和“ ARP_FILTERARPOP_REPLY ”部分)。

  • There is no configuration explicitly forbidding the transmission of an ARPOP_REPLY (see the earlier sections "ARP_IGNORE" and "ARP_FILTER").

如果一切正常,则使用正确的输入参数arp_process进行调用。在“传输 ARP 数据包:arp_send 简介”一节中进行了描述。arp_sendarp_send

If everything is OK, arp_process calls arp_send with the right input parameters. arp_send was described in the section "Transmitting ARP Packets: Introduction to arp_send."

        rt = (struct rtable*)skb->dst;
        addr_type = rt->rt_type;

        if (addr_type == RTN_LOCAL) {
            n = neigh_event_ns(&arp_tbl, sha, &sip, dev);
            如果(n){
                int dont_send = 0;

                如果(!不发送)
                    dont_send |= arp_ignore(in_dev,dev,sip,tip);
                if (!dont_send && IN_DEV_ARPFILTER(in_dev))
                    dont_send |= arp_filter(sip,tip,dev);
                如果(!不发送)
                    arp_send(ARPOP_REPLY,ETH_P_ARP,sip,dev,
                                           提示,sha,dev->dev_addr,sha);
                neigh_release(n);
            }
            转到出去;
        } 别的 {

                   /* 如果满足所有必需的条件,则处理代理 ARP */
                   /* 满足。请参阅“代理 ARP ”部分*/
        rt = (struct rtable*)skb->dst;
        addr_type = rt->rt_type;

        if (addr_type == RTN_LOCAL) {
            n = neigh_event_ns(&arp_tbl, sha, &sip, dev);
            if (n) {
                int dont_send = 0;

                if (!dont_send)
                    dont_send |= arp_ignore(in_dev,dev,sip,tip);
                if (!dont_send && IN_DEV_ARPFILTER(in_dev))
                    dont_send |= arp_filter(sip,tip,dev);
                if (!dont_send)
                    arp_send(ARPOP_REPLY,ETH_P_ARP,sip,dev,
                                           tip,sha,dev->dev_addr,sha);
                neigh_release(n);
            }
            goto out;
        } else {

                   /* Handle Proxy ARP if all the required conditions */
                   /* are met. See the section "Proxy ARP"           */

被动学习和ARP优化

Passive learning and ARP optimization

第 27 章的“创建邻居条目”一节提到,在 ARP 事务结束时,请求者和应答者都会学到一些东西。发送者实现了从 中了解目标地址的基本目标;这就是所谓的 主动学习。但接收到请求的目标主机从请求本身获知发送者的地址;这就是所谓的被动学习。这是对邻近协议的有价值的优化。ARPOP_REPLYARPOP_REQUEST

The section "Creating a neighbour Entry" in Chapter 27 mentioned that at the end of an ARP transaction, both the requester and the replier learn something. The sender achieves its essential goal of learning the target's address from the ARPOP_REPLY; this is called active learning. But the target host that receives the ARPOP_REQUEST learns the sender's address from the request itself; this is called passive learning. It is a valuable optimization of the neighboring protocol.

被动学习由 负责neigh_event_ns。后者检查是否已经有与请求者关联的条目;然后,它会更新现有条目或创建一个新条目(如果尚不存在)。

Passive learning is taken care of by neigh_event_ns. The latter checks if it already has an entry associated to the requester; it then updates an existing entry or creates a new entry if one doesn't already exist.

无论是更新现有条目还是创建新条目,该函数都会将邻居的状态设置为NUD_STALE。ARP 不会采取乐观的步骤来调用它,NUD_REACHABLE因为该状态是为已提供可达性证明的主机保留的,这是第 27 章中描述的更严格的要求。

Whether updating an existing entry or creating a new one, the function sets the state of the neighbor to NUD_STALE. ARP does not take the optimistic step of calling it NUD_REACHABLE because that state is reserved for hosts that have provided proof of reachability, a stricter requirement described in Chapter 27.

neigh_event_ns当创建条目失败时返回 NULL(通常是因为内存不足,即缓存中没有可用空间)。在这种情况下,不会向请求者发送回复。这个政策有点保守;更积极的方法是无论如何回复,即使我们暂时无法在我们的系统上为邻居创建条目,我们也不应该剥夺它向我们传输数据的能力。

neigh_event_ns returns NULL when it fails to create an entry (usually because of a lack of memory—that is, no space is available in the cache). In this case, a reply is not sent to the requester. This policy is a little conservative; a more aggressive approach would be to reply anyway on the basis that even though we are temporarily unable to create an entry on our system for the neighbor, we should not deprive it of the ability to transmit data to us.

neigh_event_ns调用第 27 章“缓存”部分中描述的查找函数之一。因为当搜索成功时,这些总是会增加条目的引用计数, 因此需要相应地减少引用计数。neigh_event_ns

neigh_event_ns calls one of the lookup functions described in the section "Caching" in Chapter 27. Because these always increment the entry's reference counter when the search succeeds, neigh_event_ns needs to decrement the reference count correspondingly.

地址为零的请求

Requests with zero addresses

当 ARP 请求中的源 IP 地址设置为 0(标准四元表示法中的 0.0.0.0)时,它可能是损坏的数据包,因为 0.0.0.0 不是有效的 IP 地址。但是,它也可能是 DHCP 用于检测重复地址的特殊数据包。有关发送这些数据包的条件,请参阅前面的“重复地址检测”部分;有关 0 地址的使用,请参阅 RFC 2131 第 2.2 节。

When the source IP address in an ARP request is set to 0 (0.0.0.0 in standard quad notation), it could be a corrupted packet, because 0.0.0.0 is not a valid IP address. However, it could also be a special packet used by DHCP to detect duplicated addresses. See the earlier section "Duplicate Address Detection" for the conditions under which these packets are sent, and RFC 2131, section 2.2, for the use of a 0 address.

DHCP 服务器或客户端可以选择发送ARPOP_REQUESTDHCP 分配的 IP 地址,以再次检查同一地址是否已被另一台主机错误使用。该特殊报文ARPOP_REQUEST使用源 IP 地址 0.0.0.0 发送,因此不会给子网上的其他主机带来任何麻烦。

A DHCP server or client can optionally send an ARPOP_REQUEST for a DHCP-assigned IP address to double-check whether, by mistake, the same address is already in use by another host. That special ARPOP_REQUEST is sent with a source IP address of 0.0.0.0 so that it will not create any trouble for the other hosts on the subnet.

以下代码在arp_process源 IP 地址 ( sip) 为 0 时运行,并让本地主机在数据包的发送者发出此类请求时声明一个地址:

The following code in arp_process runs when the source IP address (sip) is 0, and lets the local host claim an address when the packet's sender is making this type of request:

    如果(sip==0){
        if (arp->ar_op == htons(ARPOP_REQUEST) &&
            inet_addr_type(提示) == RTN_LOCAL &&
               !arp_ignore(in_dev, dev, sip, 提示))
            arp_send(ARPOP_REPLY,ETH_P_ARP,提示,dev,提示,sha,
                            dev->dev_addr,dev->dev_addr);
        转到出去;
    }
    if (sip == 0) {
        if (arp->ar_op == htons(ARPOP_REQUEST) &&
            inet_addr_type(tip) == RTN_LOCAL &&
               !arp_ignore(in_dev, dev, sip, tip))
            arp_send(ARPOP_REPLY,ETH_P_ARP,tip,dev,tip,sha,
                            dev->dev_addr,dev->dev_addr);
        goto out;
    }

处理 ARPOP_REPLY 数据包

Processing ARPOP_REPLY Packets

ARPOP_REPLY如果满足以下条件之一,则处理传入数据包:

Incoming ARPOP_REPLY packets are processed if one of the following conditions is met:

  • 有一个ARPOP_REQUEST与收到的匹配的待处理ARPOP_REPLY。换句话说,这 是对先前生成的内核ARPOP_REPLY的回复。ARPOP_REQUEST这是最常见的情况。

  • There is a pending ARPOP_REQUEST that matches the received ARPOP_REPLY. In other words, the ARPOP_REPLY is a reply to an ARPOP_REQUEST the kernel generated earlier. This is the most common case.

  • 没有悬而未决的ARPOP_REQUEST,但内核已经编译支持(参见“编译时选项UNSOLICITED_ARP”部分)。在这种情况下,通过使用非 NULL 最后一个参数调用来创建一个新的邻居条目。_ _neigh_lookup

    #ifdef CONFIG_IP_ACCEPT_UNSOLICTED_ARP
        如果(n == NULL &&
            arp->ar_op == htons(ARPOP_REPLY) &&
            inet_addr_type(sip) == RTN_UNICAST)
            n = _ _neigh_lookup(&arp_tbl, &sip, dev, -1);
    #万一
  • There is no pending ARPOP_REQUEST, but the kernel has been compiled with support for UNSOLICITED_ARP (see the section "Compile-Time Options"). In this case, a new neighbor entry is created by calling _ _neigh_lookup with a non-NULL last parameter.

    #ifdef CONFIG_IP_ACCEPT_UNSOLICITED_ARP
        if (n == NULL &&
            arp->ar_op == htons(ARPOP_REPLY) &&
            inet_addr_type(sip) == RTN_UNICAST)
            n = _ _neigh_lookup(&arp_tbl, &sip, dev, -1);
    #endif

图28-18的右侧和左侧分别显示了如何处理这两种情况。

The right and left sides of Figure 28-18, respectively, show how these two cases are handled.

arp_process 处理 ARPOP_REPLY

图 28-18。arp_process 处理 ARPOP_REPLY

Figure 28-18. ARPOP_REPLY handling by arp_process

无论数据包为何被接受,现有neighbour条目都会通过下一节中描述的通用代码(如图中的虚线框所示)更新,以反映数据包中的信息 ARPOP_REPLY

Regardless of why the packet is accepted, the existing neighbour entry is updated by the common code described in the next section (and is shown in the dotted box in the figure) to reflect the information in the ARPOP_REPLY packet.

最终共同处理

Final Common Processing

最后一部分针对所有数据包以及由于不满足“处理 ARPOP_REQUEST 数据包arp_process”部分中列出的条件而未处理的数据包执行。ARPOP_REPLYARPOP_REQUEST

The last part of arp_process is executed for all ARPOP_REPLY packets, and for ARPOP_REQUEST packets that have not been processed because they did not meet the conditions listed in the section "Processing ARPOP_REQUEST Packets."

请记住,当主机回复 ARP 时ARPOP_REQUEST,它会反转 ARP 标头的源字段和目标字段,并填充空白。

Remember that when a host replies to an ARPOP_REQUEST, it inverts the source and destination fields of the ARP header, as well as fills in the empty spaces.

在阅读这段代码时需要理解的另一个概念是 locktime。这与内核经常使用的信号量锁定类型无关。相反,它是一种简单的超时,用于处理主机可能会收到多个ARPOP_REPLY相同ARPOP_REQUEST. 如果存在某种配置错误或同一 LAN 上有多个代理 ARP 服务器,则可能会发生这种情况;该arp_process 函数的反应是仅使用第一个回复并拒绝后续回复。

Another concept to understand, in reading this code, is the locktime. This is unrelated to the semaphore type of locking used frequently by the kernel. Rather, it's a simple kind of timeout that takes care of the chance that a host could receive more than one ARPOP_REPLY for the same ARPOP_REQUEST. This could happen if there is some kind of misconfiguration or if there are multiple proxy ARP servers on the same LAN; the arp_process function reacts by using only the first reply and rejecting subsequent replies.

其机制如下:相邻子系统locktimeneigh_table结构体中引入参数;该参数也可以通过/proc进行调整。以下代码设置override为反映 的未来时间locktime。(locktime以 jiffies 表示,因此值 表示HZ1 秒。) neigh_update仅当在前面的锁定时间内没有为同一条目调用该函数时,才会调用该函数来更新条目。

The mechanism is as follows: the neighboring subsystem introduces the locktime parameter in the neigh_table structure; the parameter can also be tuned by /proc. The following code sets override to a time in the future that reflects locktime. (locktime is expressed in jiffies, so a value of HZ means 1 second.) The neigh_update function is called to update an entry only if it wasn't called for that same entry during the preceding locktime.

因此,最终的代码是:

Thus, the final code is:

    n = _ _neigh_lookup(&arp_tbl, &sip, dev, 0);
    ...
    如果(n){
        int 状态 = NUD_REACHABLE;
        int 覆盖;

        override = time_after(jiffies, n->updated + n->parms->locktime);

        if (arp->ar_op != htons(ARPOP_REPLY) ||
            skb->pkt_type != PACKET_HOST)
            状态=NUD_STALE;

        neigh_update(n,sha,状态,覆盖?NEIGH_UPDATE_F_OVERRIDE:0);
        neigh_release(n);
}
    n = _ _neigh_lookup(&arp_tbl, &sip, dev, 0);
    ...
    if (n) {
        int state = NUD_REACHABLE;
        int override;

        override = time_after(jiffies, n->updated + n->parms->locktime);

        if (arp->ar_op != htons(ARPOP_REPLY) ||
            skb->pkt_type != PACKET_HOST)
            state = NUD_STALE;

        neigh_update(n, sha, state, override ? NEIGH_UPDATE_F_OVERRIDE : 0);
        neigh_release(n);
}

代码必须选择正确的状态来分配给neighbour正在更新的条目。正如第 26 章可达性”一节中所解释的,单播和广播回复具有不同级别的权限。单播回复 ( ) 将邻居状态设置为,广播回复将其设置为。由数据包引起的更新总是将状态设置为。PACKET_HOSTNUD_REACHABLENUD_STALEARPOP_REQUESTNUD_STALE

The code has to select the right state to assign to the neighbour entry being updated. As explained in the section "Reachability" in Chapter 26, unicast and broadcast replies have different levels of authority. A unicast reply (PACKET_HOST) sets the neighbor state to NUD_REACHABLE, and a broadcast reply sets it to NUD_STALE. Updates caused by ARPOP_REQUEST packets always set the state to NUD_STALE.

代理ARP

Proxy ARP

在“处理入口 ARP 数据包”部分中,我们看到了如何处理本地地址的请求arp_process。现在我们将了解同一函数如何以及何时处理远程地址请求。

In the section "Processing Ingress ARP Packets," we saw how requests for local addresses were handled by arp_process. Now we will see how and when requests for remote addresses are handled by the same function.

我们在第26章的“代理所需的条件” 和第27章的“每设备代理和每目的地代理”部分中看到,内核支持两种类型的代理:基于设备和基于目的地(或全局)。默认情况下,主机上禁用每设备代理 ARP。它可以通过/proc接口全局启用或针对每个设备启用。内核可以通过include/linux/inetdevice.hIN_DEV_PROXY_ARP中定义的宏检查给定设备上是否启用了代理 ARP 。可以使用arpip neigh命令配置按目标代理(请参阅第 29 章中的“邻居系统管理” 部分)。

We saw in the sections "Conditions Required by the Proxy" in Chapter 26 and "Per-Device Proxying and Per-Destination Proxying" in Chapter 27, that the kernel supports two types of proxying : device-based and destination-based (or global). Per-device proxy ARP is disabled on a host by default. It can be enabled either globally or on a per-device basis via the /proc interface. The kernel can check whether proxying ARP is enabled on a given device through the IN_DEV_PROXY_ARP macro defined in include/linux/inetdevice.h. Per-destination proxying can be configured with either the arp or the ip neigh command (see the section "System Administration of Neighbors" in Chapter 29).

ARP 增加了一项进行代理的条件:目标网络地址转换。我们将在“目标 NAT (DNAT) ”部分中了解为什么内核需要在配置 DNAT 时代理请求。

ARP adds one more condition under which it does proxying: Destination Network Address Translation. We will see in the section "Destination NAT (DNAT)" why the kernel needs to proxy requests when DNAT is configured.

为了ARPOP_REQUEST有资格由代理服务器处理,必须满足以下条件:

For an ARPOP_REQUEST to be eligible for handling by a proxy server, the following conditions must be true:

  • 转发在接收设备上启用,或在代理主机上全局启用。

  • Forwarding is enabled on the receiving device, or globally on the proxying host.

  • 目标IP地址是单播的(因为其他地址类型不需要ARP来解析,正如我们在第26章“特殊情况”一节中看到的)。用代码术语来说,.addr_type==RTN_UNICAST

  • The target IP address is unicast (because other address types don't need ARP to be resolved, as we saw in the section "Special Cases" in Chapter 26). In code terms, addr_type==RTN_UNICAST.

  • 接收此请求的设备不是可以到达目标 IP 地址的设备(因为如果是,则不需要代理:目标主机可以自行回复)。用代码术语来说,rt->u.dst.dev!=dev.

  • The device receiving this request is not the one through which the target IP address can be reached (because if it was, no proxying would be needed: the target host can reply by itself). In code terms, rt->u.dst.dev!=dev.

该函数中的以下代码arp_process显示了它如何检查刚刚列出的条件:

The following code from the arp_process function shows how it checks for the conditions just listed:

        if (addr_type == RTN_LOCAL) {
                   …………
        } 否则如果 (IN_DEV_FORWARD(in_dev)) {
            if ((rt->rt_flags&RTCF_DNAT) ||
                (addr_type == RTN_UNICAST && rt->u.dst.dev != dev &&
                 (arp_fwd_proxy(in_dev, rt) ||
                         pneigh_lookup(&arp_tbl, &tip, dev, 0)))) {
        if (addr_type == RTN_LOCAL) {
                   ... ... ...
        } else if (IN_DEV_FORWARD(in_dev)) {
            if ((rt->rt_flags&RTCF_DNAT) ||
                (addr_type == RTN_UNICAST  && rt->u.dst.dev != dev &&
                 (arp_fwd_proxy(in_dev, rt) ||
                         pneigh_lookup(&arp_tbl, &tip, dev, 0)))) {

如果满足基本条件,代理主机将检查其基于设备和基于目标的代理的配置。其逻辑如图26 章图26-8所示。以下条件决定代理主机是否响应该地址。

If the basic conditions are met, the proxy host checks its configuration of device-based and destination-based proxying. The logic is shown in Figure 26-8 in Chapter 26. The following conditions determine whether the proxy host responds to the address.

  • 代理 ARP 在设备上或全局启用。

  • Proxy ARP is enabled either on the device or globally.

  • 入口和出口接口不在同一介质上,如“介质 ID ”部分中所述。

  • The ingress and egress interfaces are not on the same medium, as explained in the section "Medium ID."

  • 目标地址位于被代理的地址数据库中。该数据库按目标地址组织,并通过该pneigh_lookup函数进行查询。

  • The target address is in the database of addresses being proxied. This database is organized by destination address and is queried through the pneigh_lookup function.

假设arp_process有绿灯可以处理ARPOP_REQUEST.

Let's suppose that arp_process has the green light to process the ARPOP_REQUEST.

首先,neigh_event_ns用于创建(或仅更新)neighbour发送者 IP 地址的条目,就像 ARP 处理本地地址请求时所做的那样,如“被动学习和 ARP 优化”部分中所述。

First, neigh_event_ns is used to create (or just update) a neighbour entry for the sender's IP address, just as it does when ARP is processing requests for local addresses as described in the section "Passive learning and ARP optimization."

可以延迟代理 ARP 的处理,以防止网络上出现流量突发,如第 27 章“延迟处理请求请求”部分所述。因此,如果数据包直接来自另一台主机并配置了延迟处理,则会将其排入代理队列。如果数据包来自队列(即之前已经入队,并且到了处理它的时间),或者如果未配置延迟处理,则现在处理该数据包。

Processing of proxy ARP can be delayed to prevent bursts of traffic on the network, as described in the section "Delayed Processing of Solicitation Requests" in Chapter 27. Thus, if a packet comes directly from another host and delayed processing is configured, it is enqueued on the proxy queue. If the packet comes from the queue (that is, it was previously enqueued and the time has come to handle it) or if delayed processing is not configured, the packet is processed now.

                n = neigh_event_ns(&arp_tbl, sha, &sip, dev);
                如果(n)
                    neigh_release(n);

                if (skb->stamp.tv_sec == LOCALLY_ENQUEUED ||
                    skb->pkt_type == PACKET_HOST ||
                    in_dev->arp_parms->proxy_delay == 0) {
                    arp_send(ARPOP_REPLY,ETH_P_ARP,
                                           sip,dev,tip,sha,dev->dev_addr,sha);
                } 别的 {
                    pneigh_enqueue(&arp_tbl, in_dev->arp_parms, skb);
                    in_dev_put(in_dev);
                    返回0;
                }
                转到出去;
            }
        }
                n = neigh_event_ns(&arp_tbl, sha, &sip, dev);
                if (n)
                    neigh_release(n);

                if (skb->stamp.tv_sec == LOCALLY_ENQUEUED ||
                    skb->pkt_type == PACKET_HOST ||
                    in_dev->arp_parms->proxy_delay == 0) {
                    arp_send(ARPOP_REPLY,ETH_P_ARP,
                                           sip,dev,tip,sha,dev->dev_addr,sha);
                } else {
                    pneigh_enqueue(&arp_tbl, in_dev->arp_parms, skb);
                    in_dev_put(in_dev);
                    return 0;
                }
                goto out;
            }
        }

目标 NAT (DNAT)

Destination NAT (DNAT)

目的NAT,也称为路由NAT在IPROUTE2术语中,允许主机定义虚​​拟(NAT)地址:发送给它们的入口数据包被主机检测到并转发到另一个地址。DNAT主要由路由器使用,与Netfilter实现的Destination NAT无关。[ * ]

Destination NAT, also called Route NAT in IPROUTE2 terminology, allows a host to define dummy (NAT) addresses: ingress packets addressed to them are detected by the host and forwarded to another address. DNAT is used mainly by routers, and bears no relation to the Destination NAT implemented by Netfilter.[*]

需要注意的是,虽然 Linux 中的 ARP 代码可以处理 DNAT,但路由代码似乎已经放弃了对其的支持。因此,该功能目前在内核 2.6 中已被破坏。

It should be noted that although the ARP code in Linux handles DNAT, the routing code seems to have dropped support for it. Therefore, this feature is currently broken in kernel 2.6.

图 28-19说明了 DNAT。路由器 RT 已配置虚拟 NAT 地址 10.0.0.5。每当 RT 收到寻址到 10.0.0.5 的流量时,它会将目标地址更改为 10.0.1.10 并将流量转发到具有该地址的主机。当然,该配置可确保反向流量也得到处理。

Figure 28-19 illustrates DNAT. The router RT has been configured with the dummy NAT address 10.0.0.5. Whenever RT receives traffic addressed to 10.0.0.5, it changes the destination address to 10.0.1.10 and forwards the traffic to the host with that address. Of course, the configuration ensures that reverse traffic is also taken care of.

DNAT示例

图 28-19。DNAT示例

Figure 28-19. DNAT example

所有这一切都是使用代理 ARP 完成的。10.0.0.0/24子网中没有主机配置10.0.0.5地址。然而,该地址被公开为给定主机(例如,Web 服务器)的地址。每当子网 10.0.0.0/24 上的主机想要与 10.0.0.5 通信时,它就会像其他地址一样发送对该地址的 ARP 请求。由于 ARP 请求发送到以太网广播地址,RT 接收它并通过回复 ARP 请求来代理它,提供其eth0接口的 L2 地址。从那时起,RT 代理请求者和 10.0.1.10 之间的流量。

All of this is done using proxy ARP. In the 10.0.0.0/24 subnet, no host is configured with the 10.0.0.5 address. However, that address is publicized as the address of a given host (for instance, a web server). Whenever a host on the subnet 10.0.0.0/24 wants to talk to 10.0.0.5, it sends an ARP request for that address like any other. Because the ARP request is sent to the Ethernet broadcast address, RT receives it and proxies it by replying to the ARP request, providing the L2 address of its eth0 interface. From that moment on, RT proxies traffic between the requester and 10.0.1.10.

当主机配置虚拟 NAT 地址时,会创建一个特殊的路由表条目并用该RTCF_DNAT标志进行标记,以便 ARP 可以检查和代理该地址。

When a host configures a dummy NAT address, a special routing table entry is created and tagged with the RTCF_DNAT flag so that ARP can check and proxy the address.

代理 ARP 服务器作为路由器

Proxy ARP Server as Router

在这一点上,代理和路由器可能看起来很相似,并且在某种程度上确实如此。事实上,路由器通常是在 IPv4 下处理代理 ARP 的主机,并且(如 RFC 2461 中所述)仅允许路由器在 IPv6 下执行此操作。但代理和路由在以下方面有所不同:代理 ARP 服务器通常对其所服务的主机是透明的,而路由器则不然。每台主机都需要明确配置才能使用路由器。在最常见的场景中,代理服务器充当位于不同局域网但配置相同IP子网的主机之间的透明路由器,如图 28-20所示。

At this point, a proxy and a router may seem similar, and to some extent they are. In fact, routers are usually the hosts that handle proxy ARP under IPv4, and (as described in RFC 2461) only routers are allowed to do it under IPv6. But proxying and routing differ in the following aspect: while a proxy ARP server is usually transparent to the hosts being served by it, a router is not. Each host needs to be explicitly configured to use the router. In the most common scenario, a proxy server acts as a transparent router between hosts located in different LANs but configured with the same IP subnet, as shown in Figure 28-20.

图 28-20(a)显示了一个简单的拓扑,其中两个具有 /25 网络掩码的子网通过路由器进行通信。图28-20(b)显示了相同的拓扑如何允许两个子网上的主机通过代理而不是路由器进行通信,只需将其网络掩码从/25更改为/24;此更改加入了 10.0.1.0/25 和 10.0.1.128/25 子网。图28-20(c)是采用图28-20(b)的配置,两个子网的主机感知到的拓扑。

Figure 28-20(a) shows a simple topology where two subnets with a /25 netmask communicate via a router. Figure 28-20(b) shows how the same topology allows hosts on the two subnets to communicate via a proxy rather than a router by simply changing their netmasks from /25 to /24; this change joins the 10.0.1.0/25 and 10.0.1.128/25 subnets. Figure 28-20(c) is the topology that the hosts of the two subnets perceive with the configuration of Figure 28-20(b).

请注意,图 28-20中的示例并不意味着建议代理 ARP 服务器和路由器之间的任何配置或首选项。这两种设备用于完成不同的任务:路由器将子网隔离为 LAN,而代理 ARP 服务器将不同的 LAN 合并为单个子网。提供该示例只是为了说明主机的配置如何根据两个 LAN 中的主机是通过路由器还是代理 ARP 服务器进行通信而发生变化。当然,您可以将所有主机放置在一个 LAN 中,而不需要任何路由器和代理 ARP 服务器,但由于我们正在讨论代理 ARP,因此我需要提供其使用示例。

Note that the examples in Figure 28-20 are not meant to suggest any configuration or preference between a proxy ARP server and a router. The two devices are used to accomplish different tasks: a router segregates subnets into LANs, whereas a proxy ARP server merges different LANs into a single subnet. The example is provided only to show how the configuration of the hosts changes based on whether the hosts in the two LANs communicate via a router or a proxy ARP server. Of course, you may be able to place all hosts in a single LAN with no need for any routers and proxy ARP servers, but since we are discussing proxy ARP, I need to provide an example of its use.

有一个特殊情况值得一提:代理 ARP 服务器可以配置为充当透明默认网关。换句话说,管理员可以让主机使用代理 ARP 来到达默认路由,而不是在 LAN 中的每台主机上配置默认路由。为此,管理员使用具有 /0 网络掩码的地址配置主机,该网络掩码与定义默认网关路由时使用的网络掩码相同。通过这种方式,代理 ARP 服务器处理所有发往未知地址的流量,从而有效地成为默认网关。代理 ARP 服务器甚至可以更改其地址,而不会影响主机,只要它更新neighbour主机缓存中的所有旧条目即可(请参阅“免费 ARP”一节)”了解如何做到这一点)。但是,这种看起来很聪明的场景效率并不高,原因我将在下面解释。

There is a special case worth mentioning: a proxy ARP server can be configured to act as a transparent default gateway. In other words, instead of configuring a default route on each host on a LAN, the administrator can let hosts use proxy ARP to reach the default route. To do this, the administrator configures the hosts with addresses that have a /0 netmask, the same netmask used when defining the default gateway route. In this way, the proxy ARP server handles all traffic to unknown addresses, effectively becoming the default gateway. The proxy ARP server can even change its address without any impact on the hosts, as long as it updates all the old neighbour entries in the host's caches (see the section "Gratuitous ARP" for how this can be done). However, this clever-looking scenario is not very efficient, for reasons I'll explain next.

包含代理 ARP 服务器的网络拓扑(如图28-20(b)所示)会注册大量请求请求和回复。当被代理的主机数量较多时,请求使用的带宽百分比可能会变得相当大。

A network topology that includes a proxy ARP server, like the one in Figure 28-20(b), registers a high volume of solicitation requests and replies. When the number of hosts being proxied is high, the percentage of bandwidth used by solicitations may become considerable.

给定如图 28-20(a)所示的网络,最坏的情况是 /25 子网包含完整的 126 个主机(适合 7 位的数字,减去默认地址和广播地址),并且每个主机需要解析所有其他主机的地址。这将导致 (126-1) * (126-1) 个不同的招标请求。然而,像这样的最坏情况远非平均情况,因为主机通常只需要访问少数本地计算机,例如服务器。主机的大部分流量都会流向路由器以外的主机,因此该网关路由器的 L3 和 L2 地址就是主机所需的全部。

Given a network like the one in Figure 28-20(a), the worst-case scenario is where the /25 subnet contains a full 126 hosts (the number that fits in 7 bits, minus the default and broadcast addresses), and each host needs to resolve the address of every other host. This would lead to (126-1) * (126-1) different solicitation requests. However, a worst-case scenario like that one is far from average, because a host usually has to access only a few local machines, such as servers. Most of a host's traffic goes to hosts beyond the router, so the L3 and L2 address of this gateway router is all the host needs.

代理与路由器

图 28-20。代理与路由器

Figure 28-20. Proxy versus router

如果我们保持相同的网络拓扑,但将 /24 网络掩码更改为 /0 网络掩码,则最坏情况会发生爆炸,并且平均情况开始接近最坏情况。每当一台主机想要与另一台主机通信时,无论后者是远程的还是本地的,总会有一个单独的请求。主机必须为每个主机发出单独的请求,而不是为默认网关(可用于到达路由器之外的任何主机)发出单个请求,因为它不了解路由器。

If we keep the same network topology but change the /24 netmask to a /0 netmask, the worst-case scenario explodes—and the average scenario starts to approach the worst-case scenario. Any time a host wants to communicate with another host, regardless of whether the latter is remote or local, there will always be a separate solicitation. Instead of making a single solicitation for the default gateway, which can be used to reach any host beyond the router, a host must make a separate solicitation for each host, because it has no knowledge about the router.

总而言之,使用代理 ARP 服务器作为路由器可以简化子网上主机的配置,并且由于没有路由而需要主机上较轻的 TCP/IP 堆栈。但由于请求数量较多,网络和代理 CPU 上的负载可能会增长得相当高。

To summarize, the use of a proxy ARP server as a router can simplify the configuration of the hosts on a subnet, and require a lighter TCP/IP stack on the hosts because there is no routing. But the load on the network and on the proxy's CPU can grow quite high, due to the higher number of solicitations.

例子

Examples

我们以 图28-21的拓扑为例。

Let's take the topology of Figure 28-21 as an example.

在主机 RT 上配置代理 ARP 的网络示例

图 28-21。在主机 RT 上配置代理 ARP 的网络示例

Figure 28-21. Example of network with proxy ARP configured on host RT

让我们做出以下假设:

Let's make the following hypotheses:

  • 所有主机均使用以太网卡。

  • All the hosts use Ethernet cards.

  • LAN1 和 LAN2 的所有主机都配置了网络掩码 255.255.255.0 (/24)。它们的路由表中没有任何路由,也没有配置默认网关。换句话说,LAN1 和 LAN2 中的主机只能与同一逻辑子网内的其他主机进行通信。

  • All the hosts of LAN1 and LAN2 are configured with a netmask of 255.255.255.0 (/24). They do not have any routes in their routing tables, nor do they have a default gateway configured. In other words, hosts in LAN1 and LAN2 can communicate only with other hosts within their same logical subnet.

  • 所有邻居缓存都是空的,这意味着没有一台主机知道任何其他主机的任何链路层地址。

  • All neighbor caches are empty, which means that no one host knows any link layer address of any other host.

  • 桥接在任何地方都被禁用。这不包括第 26 章图 26-10的右上角情况。如果这个假设的含义不清楚,您应该阅读第四部分

  • Bridging is disabled everywhere. This excludes the top-right case of Figure 26-10 in Chapter 26. If the implications of this hypothesis are not clear, you should read Part IV.

请注意,即使 LAN1 和 LAN2 上的两台主机都配置为属于同一逻辑子网(网络 10.0.0.0/24,网络掩码 255.255.255.0),它们实际上属于不同的 LAN。这意味着就配置而言,它们共享同一子网,并且无需任何路由器的帮助即可相互通信。然而,通过查看网络拓扑,很明显,如果没有 RT 的帮助,他们就无法做到这一点。

Note that even if both hosts on LAN1 and LAN2 have been configured as belonging to the same logical subnet (network 10.0.0.0/24, netmask 255.255.255.0), they actually belong to different LANs. This means that as far as the configuration is concerned, they share the same subnet and can communicate with each other without the help of any router. However, by looking at the network topology, it is clear that they cannot do that without the help of RT.

为了让 RT 给 LAN1 和 LAN2 的主机一种它们在同一子网中的错觉,RT 需要有一些额外的知识:它想要合并的 LAN 的真实网络掩码,即 255.255.255.128 或 /25。如果 RT 不是代理 ARP 服务器,则 RT 或 LAN1 和 LAN2 中的主机将被视为配置错误。[ * ]

For RT to give the hosts of LAN1 and LAN2 the illusion that they are on the same subnet, RT needs to have some extra knowledge: the real netmask of the LANs it wants to merge, which is 255.255.255.128 or /25. If RT was not a proxy ARP server, RT or the hosts in LAN1 and LAN2 would be considered misconfigured.[*]

为了使RT能够使LAN1和LAN2中的主机透明地通信,RT需要更多地了解网络拓扑,更确切地说,它需要知道谁在哪里。请注意,RT 在一侧转发在另一侧收到的任何内容的简单解决方案与代理无关;第四部分介绍了这种情况。如果RT想要代表(例如,代替)LAN1的主机对来自/来自LAN2的请求,它需要知道谁在哪一边。例如,RT 不应回复在 LAN1 上生成并发送至 LAN1 内其他主机的请求(因为 LAN1 的主机已属于同一子网 10.0.0.0/24)。感谢其eth1上正确的网络掩码和 eth2网卡,RT知道:

For RT to make hosts in LAN1 and LAN2 communicate transparently, RT needs to have some more knowledge about the network topology, and more exactly it needs to know who is where. Note that the simple solution where RT forwards on one side whatever it receives on the other has nothing to do with proxying; Part IV covered that scenario. If RT wants to represent (e.g., reply in place of) the hosts of LAN1 to the requests of/from LAN2, it needs to know who is on what side. For instance, RT should not reply to requests generated on LAN1 and addressed to other hosts within LAN1 (because the hosts of LAN1 already belong to the same subnet, 10.0.0.0/24). Thanks to the right netmasks on its eth1 and eth2 NICs, RT knows that:

  • eth0上,有地址范围为10.0.0.1到10.0.0.126的主机(10.0.0.127是广播,10.0.0.0是网络)。

  • On eth0, there are hosts with addresses ranging from 10.0.0.1 to 10.0.0.126 (10.0.0.127 is the broadcast and 10.0.0.0 is the network).

  • eth1上,有地址范围为10.0.0.129到10.0.0.254的主机(10.0.0.255是广播,10.0.0.128是网络)。

  • On eth1, there are hosts with addresses ranging from 10.0.0.129 to 10.0.0.254 (10.0.0.255 is the broadcast and 10.0.0.128 is the network).

让我提醒您,需要路由器将数据包从一个子网转发到另一个子网(即发送者和接收者不在同一子网中)。请注意,如果 RT 上未启用代理 ARP,则路由器 RT 的连接 LAN1 和 LAN2 的两个 NIC 将配置错误。

Let me remind you that a router is needed to forward packets from one subnet to another one (i.e., sender and receiver are not in the same subnet). Note that the two NICs of router RT that go to LAN1 and LAN2 would be misconfigured if proxy ARP was not enabled on RT.

现在我们来分析几个常见的情况。您可以参考第 26 章中的图 26-9图 26-10了解预期行为:

Let's now analyze a few common cases. You can refer to Figure 26-9 and Figure 26-10 in Chapter 26 for the expected behaviors:

  • (a) 从LAN1到LAN1(例如从主机D到主机E)

    ARPOP_REQUEST由于主机 D (10.0.0.2) 与主机 E 在同一子网 (10.0.0.0/24) 中,因此它可以发送IP 地址 10.0.0.3 的请求请求 ( )。LAN1 中的所有主机都会收到该请求,但只有主机 E 会回复主机 D 并指定其 L2 地址。请注意,即使在eth0上启用了代理,RT 也不会回复。原因是RT在eth0(LAN1)上收到了请求,并且由于它知道10.0.0.3位于发送者的同一子网内,因此不需要拦截该请求:主机E驻留在请求到来的网络中来自,因此它可以自己回答。

  • (a) From LAN1 to LAN1 (e.g., from Host D to Host E)

    Since Host D (10.0.0.2) is in the same subnet (10.0.0.0/24) as Host E, it can send a solicitation request (ARPOP_REQUEST) for the IP address 10.0.0.3. All of the hosts in LAN1 will receive that request, but only Host E will reply to Host D specifying its L2 address. Note that RT would not reply even if proxying was enabled on eth0. The reason is that RT received the solicitation on eth0 (LAN1), and since it knows that 10.0.0.3 is located within the same subnet of the sender, it does not need to intercept the request: Host E resides in the network the solicitation comes from and therefore it can answer by itself.

  • (b) 从 LAN1 到 LAN1 中的非法 IP 地址(例如从主机 D 到 10.0.0.128)

    从A2的角度来看,10.0.0.128是一个有效的主机地址;从 RT 的角度来看它不是(它是一个网络地址)。没有人会回复。无论 RT 是否是代理,都是如此。

    这里棘手的部分是,即使 LAN1 和 LAN2 的主机配置了 10.0.0.0/24 网络掩码,它们也已根据 RT 的配置在物理上划分在两侧。RT 不会回复,因为它将 10.0.0.128 识别为网络地址。

  • (b) From LAN1 to an illegal IP address in LAN1 (e.g., from Host D to 10.0.0.128)

    From A2's perspective, 10.0.0.128 is a valid host address; from RT's perspective it is not (it is a network address). No one is going to reply. This is true regardless of whether RT is a proxy.

    The tricky part here is that even if the hosts of LAN1 and LAN2 are configured with a 10.0.0.0/24 netmask, they have been physically divided on the two sides accordingly to RT's configuration. RT does not reply because it recognizes 10.0.0.128 as a network address.

  • (c) 从LAN1到LAN2(例如从主机D到主机A)

    由于地址为 10.0.0.130 的主机位于另一个 LAN 上,因此该主机将无法接收请求并回复。但是,由于 RT 配置为在eth0上启用了代理,因此它将回复其eth0接口的地址。这意味着,当主机 D 向主机 A 发送数据时,它实际上会将数据发送给 RT,RT 将简单地将数据转发给主机 A。如果主机 A 请求主机 D 的地址,则会发生相反的情况。

  • (c) From LAN1 to LAN2 (e.g., from Host D to Host A)

    Since the host with address 10.0.0.130 is on another LAN, the host would not be able to receive the request and reply. However, since RT is configured with proxy enabled on eth0, it will reply with the address of its eth0 interface. This means that when Host D sends data to Host A, it will actually send it to RT, which will simply forward it to Host A. The opposite would have happened if Host A had asked for Host D's address.

  • (d) 从 LAN1 到 LAN3(例如从主机 D 到主机 F)

    由于主机 F 与主机 D 不在同一子网(10.0.1.2 不在 10.0.0.0/24 中),并且主机 D 中没有定义到达 LAN3(10.0.1.0/24)的路由,因此内核的 IP 层主机D会回复一条消息说没有路由可以到达主机F,并且主机D甚至不会生成请求请求。[ * ]

  • (d) From LAN1 to LAN3 (e.g., From Host D to Host F)

    Since Host F is not on the same subnet as Host D (10.0.1.2 is not in 10.0.0.0/24) and no routes are defined in Host D to reach LAN3 (10.0.1.0/24), the IP layer of the kernel in Host D would reply with a message saying that no route is available to reach Host F, and Host D would not even generate a solicitation request.[*]

外部事件

External Events

当特殊情况出现时,ARP 可以接收并生成通知。第 27 章中的“相邻协议和 L3 传输功能之间的交互”部分概述了相邻协议如何与内核的其余部分交互。在这里我们将具体了解 ARP 如何处理这些通知。

ARP can both receive and generate notifications when special conditions come into being. The section "Interaction Between Neighboring Protocols and L3 Transmission Functions" in Chapter 27 gives an overview of how neighboring protocols interact with the rest of the kernel. Here we will see in particular how ARP takes care of these notifications.

收到的事件

Received Events

我们在“ ARP 协议初始化”一节中看到,ARP 向内核注册以通知设备事件,并且这arp_netdev_event是处理这些事件的处理程序。在该函数接收的各种事件类型中,ARP 仅对 感兴趣NETDEV_CHANGEADDR,它是在设备的 L2 地址更改(例如,通过手动配置)时生成的。处理用户空间请求以更改设备链路层地址并因此生成NETDEV_CHANGEADDR通知的内核例程在net/core/rtnetlink.cdo_setlink中定义。

We saw in the section "ARP Protocol Initialization" that ARP registers with the kernel for the notification of device events and that arp_netdev_event is the handler that takes care of those events. Among the various event types that the function receives, ARP is interested only in NETDEV_CHANGEADDR, which is generated when the L2 address of a device is changed (e.g., via manual configuration). The kernel routine that processes the user-space request to change a device link layer address, and that therefore generates the NETDEV_CHANGEADDR notification, is do_setlink, defined in net/core/rtnetlink.c.

静态整型
arp_netdev_event(struct notifier_block *this, unsigned long event, void *ptr)
{
    结构net_device *dev = ptr;

    开关(事件){
    案例 NETDEV_CHANGEADDR:
        neigh_changeaddr(&arp_tbl, dev);
        rt_cache_flush(0);
        休息;
    默认:
        休息;
    }

    返回NOTIFY_DONE;
}
static int
arp_netdev_event(struct notifier_block *this, unsigned long event, void *ptr)
{
    struct net_device *dev = ptr;

    switch (event) {
    case NETDEV_CHANGEADDR:
        neigh_changeaddr(&arp_tbl, dev);
        rt_cache_flush(0);
        break;
    default:
        break;
    }

    return NOTIFY_DONE;
}

neigh_changeaddr第 27 章“相邻层接收的事件”部分对此进行了描述。

neigh_changeaddr is described in the section "Events Received by the Neighboring Layer" in Chapter 27.

rt_cache_flush刷新 IPv4 路由缓存,以便强制 IP 层开始使用新的 L2 地址。此函数不会有选择地删除与生成通知的设备关联的条目,而只是删除缓存中的所有内容。第 33 章详细介绍了输入参数的含义和路由缓存的一般情况。

rt_cache_flush flushes the IPv4 routing cache so that the IP layer is forced to start using the new L2 address. This function does not selectively delete the entries associated with the device that generated the notification, but simply removes everything in the cache. Chapter 33 contains details about the meaning of the input parameter and the routing cache in general.

生成的事件

Generated Events

虚函数error_report 是结构体的一部分,在第27章“相邻层生成的事件neigh_ops 一节中提到过。在ARP中,这个功能是由. 当 ARP 事务失败时,ARP 子系统将调用该例程。它的两个主要任务是:arp_error_report

The error_report virtual function, which is part of the neigh_ops structure, was mentioned in the section "Events Generated by the Neighboring Layer" of Chapter 27. In ARP, this function is carried out by arp_error_report. The ARP subsystem invokes the routine when an ARP transaction fails. Its two main tasks are:

  • 从路由表缓存中删除与不可达邻居关联的条目。[ * ]

  • Remove the entry associated with the unreachable neighbor from the routing table cache.[*]

  • 通过 ICMP UNREACHABLE 消息向发送方通知无法到达的邻居。

  • Notify the sender about the unreachable neighbor by means of an ICMP UNREACHABLE message.

局域网唤醒事件

Wake-on-LAN Events

一些复杂的 NIC 支持称为 LAN 唤醒 (WOL) 的功能

Some sophisticated NICs support a feature called Wake-on-LAN (WOL) .

第 6 章中简要介绍的 WOL是一项功能,允许 NIC 在收到特定类型的帧时唤醒处于待机模式的系统。引起唤醒的各种类型的帧包括 ARP 数据包。该功能是在硬件级别实现的,因为处于待机模式的系统没有在 CPU 中运行的可以处理传入数据包的设备驱动程序。启用 WOL 的 NIC 需要有自己的电源才能扫描这些特殊帧。我不会详细介绍此功能,因为它完全由 NIC 驱动程序处理,而不是由 ARP 模块处理。有关详细信息,请浏览关键字的代码 WAKE_ARP

WOL, briefly introduced in Chapter 6 is a feature that allows an NIC to wake up a system in standby mode when it receives a specific type of frame. Among the various types of frames that cause wake-ups are ARP packets. The feature is implemented at the hardware level because a system in standby mode does not have a device driver running in the CPU that can process incoming packets. WOL-enabled NICs need to have their own source of power to be able to scan for those special frames. I will not go into detail on this feature because it is handled entirely by the NIC's drivers, not by the ARP module. For details, browse the code for the WAKE_ARP keyword.

ARPD

ARPD

一个网段上的邻居数量可以从几个到几千个。neighbour因此,在大型网络上,数据结构所需的内存可能会变得相当大并影响系统性能。gc_thresh n增加结构体中配置参数的值neigh_table只是改变了可以创建的最大条目数,但并不能解决过度消耗有限内核内存的性能问题。

The number of neighbors on a network segment can range from a few to many thousands. On large networks, the memory required by neighbour data structures can therefore grow quite big and affect system performance. Increasing the values of the gc_thresh n configuration parameters in the neigh_table structure simply changes the maximum number of entries that can be created, but it does not solve the performance problem of over-consumption of limited kernel memory.

arpd是一个用户空间守护进程,可以通过保留自己的(更大的)缓存来卸载内核的工作。ARP 的用户空间实现不可能像内核实现一样快,但这种差异在大多数情况下是可以接受的。

arpd is a user-space daemon that can offload work from the kernel by keeping its own (bigger) cache. A user-space implementation of ARP cannot be as fast as a kernel implementation, but the difference is acceptable in most cases.

要使用arpd,必须编译支持 ARPD 功能的内核。内核文档称 ARPD 为实验性功能,但实际上它已经存在很长时间了。

To use arpd, a kernel has to be compiled with support for the ARPD feature. The kernel documentation calls ARPD an experimental feature, but it has actually been around for a long time.

目前有两个arpd守护程序可供下载。一个是旧的并且不能正常工作,另一个是 IPROUTE2 包的一部分并且可以工作。我将在本节中参考第二个。

Two arpd daemons are currently available for download. One is old and does not work properly, and the other is part of the IPROUTE2 package and does work. I will refer to the second one in this section.

arpd守护进程负责拦截来自其他系统的 ARP 请求并维护自己的数据库来代替内核缓存。本章我们不会过多谈论守护进程的内部,但我们将重点讨论守护进程与内核之间的交互。在arpd维护自己与网络的关系的同时,内核还可以继续处理ARP请求,并负责将 内核知道的事件通知给arpd 。它们通过 Netlink 套接字进行通信,2.6 内核默认支持该套接字。

The arpd daemon is responsible for intercepting ARP requests from other systems and maintaining its own database in lieu of a kernel cache. We won't say much about the internals of the daemon in this chapter, but we will focus on the interaction between the daemon and the kernel. While arpd maintains its own relationship with the network, the kernel can also continue to handle ARP requests, and is responsible for notifying arpd about events the kernel knows about. They communicate via a Netlink socket, which is supported by default in the 2.6 kernel.

图 28-22给出了相邻子系统、ARP 和arpd之间交互的大图。本质上,相邻子系统向守护进程发送通知,而守护进程则侦听它们。接下来的两节将更详细地介绍这种交互。

Figure 28-22 gives the big picture of the interaction between the neighboring subsystem, ARP, and arpd. Essentially, the neighboring subsystem sends notifications to the daemon and the daemon listens for them. The next two sections go into more detail on this interaction.

ARP 和 arpd 守护进程之间的交互

图 28-22。ARP 和 arpd 守护进程之间的交互

Figure 28-22. Interaction between ARP and arpd daemon

内核端

Kernel Side

当 ARPD 启用时,相邻子系统将消息发送到用户空间守护程序。在这里,我们回顾一下用于发送这些消息的例程以及调用例程的条件:

When ARPD is enabled, the neighboring subsystem sends messages to the user-space daemon. Here we review the routines used to send those messages and the conditions under which the routines are invoked:

neigh_apps_ns
neigh_apps_ns

arp_solicit当允许内核发送的请求(探测)数量耗尽而用户空间生成的请求数量未耗尽时,从协议的请求函数 ( ) 中调用此函数。使用arpd 的规则是内核必须在调用守护进程之前用完对邻居的所有探测。然而,没有什么可以阻止管理员配置 ARPD,以便内核根本不生成任何探测,并立即调用arpd 。

neigh_app_ns生成类型为 的消息 RTM_GETNEIGH

This is called from the protocol's solicit function (arp_solicit) when the number of solicitations (probes) the kernel is allowed to send is exhausted and the number of user-space-generated solicitations is not. The rule for using arpd is that the kernel must use up all the probes for a neighbor before invoking the daemon. However, nothing prevents an administrator from configuring ARPD so that the kernel generates no probes at all, and invokes arpd right away.

neigh_app_ns generates messages of type RTM_GETNEIGH.

neigh_app_notify
neigh_app_notify

这用于发送 ARPD 两种通知:

  • 条目neighbour已移至该 NUD_FAILED状态,很快就会被垃圾收集器删除。在这种情况下,状态的改变和调用由(第 27 章中描述)neigh_app_notify处理 。neigh_periodic_timer

  • 邻居的状态已从有效状态(派生状态 NUD_VALID)更改为无效状态,或者邻居的 L2 地址已更改。neigh_app_notify在这种情况下,这些状态变化和调用 由 处理neigh_update

neigh_app_notify生成类型为 的消息 RTM_NEWNEIGH

This is used to send ARPD two kinds of notifications:

  • A neighbour entry has been moved to the NUD_FAILED state and will soon be deleted by the garbage collector. This change of state and the call to neigh_app_notify are handled in this case by neigh_periodic_timer (described in Chapter 27).

  • The state of a neighbor has changed from a valid one (the derived state NUD_VALID) to an invalid one, or the neighbor's L2 address has changed. These changes of state and the calls to neigh_app_notify are handled in this case by neigh_update.

neigh_app_notify generates messages of type RTM_NEWNEIGH.

用户空间侧

User-Space Side

在上一节中,我们看到内核何时向arpd发送通知。现在我们来看看arpd如何处理它们。这是守护进程(函数)的骨架main

In the previous section we saw when the kernel sends notifications to arpd. Now we'll see how arpd handles them. Here's the skeleton of the daemon (the main function):

1. 解析命令行选项
2. 打开数据库
3. 如果存在选项,则从文件加载数据库
    (3.1)打开socket用于ARP包的接收和发送
    (3.2) 使用内核打开套接字以进行 ARPD 通知
4.无限循环
    (4.1) 轮询两个套接字
    (4.2) 如果套接字(1)上出现事件,则处理输入的ARP数据包
    (4.3) 如果套接字(2)上出现事件,则处理输入内核消息
1. Parse command-line options
2. Open database
3. Load database from file if option present
    (3.1) Open socket for reception and transmission of ARP packets
    (3.2) Open socket with kernel for ARPD notifications
4. Infinite loop
    (4.1) Poll the two sockets
    (4.2) If events appear on socket (1), process input ARP packet
    (4.3) If events appear on socket (2), process input kernel message

这种行为的简化模型如图 28-23所示。(图28-23(a)代表4.2,图28-23(b)代表4.3。)它应该清楚地显示与上一节中描述的内核行为的对应关系。

A simplified model of this behavior is shown in Figure 28-23. (Figure 28-23(a) represents 4.2, and Figure 28-23(b) represents 4.3.) It should clearly show a correspondence to the kernel behavior described in the previous section.

该守护进程接受一些命令行选项来调整其行为。例如,管理员可以指定:

The daemon accepts a few command-line options to tune its behavior. For instance, the administrator can specify:

  • 放弃之前要发送多少个探测

  • How many probes to send before giving up

  • 内核是否也应该生成探针,或者只是守护进程

  • Whether the kernel should generate probes too, or just the daemon

  • 将文件中的条目上传到缓存中

  • Uploading entries into the cache from a file

当前的arpd守护程序使用通用 Berkeley DB 数据库实现其 ARP 缓存,这就是为什么当管理员安装 IPROUTE2 软件包时,它包含对 Berkeley DB 软件包的依赖项。

The current arpd daemon implements its ARP cache using a generic Berkeley DB Database, which is the reason why, when an administrator installs the IPROUTE2 package, it includes a dependency on the Berkeley DB package.

arpd和内核 ARP 子系统之间的一个区别值得一提:与内核 ARP 缓存不同, arpd缓存存储负结果。当尝试解析地址失败时,守护程序会将该信息存储在其缓存中,并且在一定时间内不会重试解析。

One difference between arpd and the kernel's ARP subsystem is worth mentioning: unlike the kernel ARP cache, the arpd cache stores negative results. When an attempt to resolve an address fails, the daemon stores that information in its cache and does not retry the resolution for a certain amount of time.

反向地址解析协议 (RARP)

Reverse Address Resolution Protocol (RARP)

RARP 是一种旧协议,可用于自动配置动态主机。它的功能被bootp取代,然后被 DHCP 取代。尽管 RARP 与 ARP 具有不同的用途,但 RARP 也使用 ARP 数据包(与ARPOP_REQUEST和具有不同的操作代码ARPOP_REPLY)并共享相同的传输例程arp_send。Linux 内核默认不包含 RARP;它必须在编译时显式添加。

RARP is an old protocol that can be used to autoconfigure a dynamic host. Its function was replaced by bootp and then DHCP. Although RARP has a different purpose from ARP, RARP also uses ARP packets (with different operation codes from ARPOP_REQUEST and ARPOP_REPLY) and shares the same transmit routine arp_send. RARP is not included by default on the Linux kernel; it has to be added explicitly at compilation time.

(a) 处理ARP数据包; (2)处理内核消息

图 28-23。(a) 处理ARP数据包;(2)处理内核消息

Figure 28-23. (a) Processing ARP packets; (2) processing kernel messages

ND (IPv6) 相对于 ARP (IPv4) 的改进

Improvements in ND (IPv6) over ARP (IPv4)

正如 第 26 章中所解释的,IPv6 邻居协议 ND 的设计与 ARP 非常不同。以下是 ND 的一些改进:

As explained in Chapter 26, the IPv6 neighboring protocol ND has a very different design from ARP. Here are some of the improvements in ND:

  • ND是ICMPv6提供的功能,ICMPv6是一个强大的协议,涵盖了ARP、ICMPv4等功能。特别是,正如我们在第 26 章的“邻居协议” 部分中看到的,将 ND 放入 ICMP 中可以使 ND 能够利用所提供的任何 L3 功能,尤其是加密。

  • ND is a function provided by ICMPv6, a powerful protocol that covers the functionalities of ARP, ICMPv4, and more. In particular, as we saw in the section "Neighboring Protocols" in Chapter 26, putting ND into ICMP allows ND to take advantage of any L3 feature provided, notably encryption.

  • ND 使用多播请求而不是广播。要使用的多播地址是从要请求的目标地址派生的,这意味着只有那些注册给定 IP 多播地址的主机才会收到关联的请求请求。在大型网络中,这可以大大减少主机接收和丢弃的请求数量,因为它们不是目标。

  • ND uses multicast solicitations rather than broadcasts. The multicast address to use is derived from the target address to solicit, which means that only those hosts that register for a given IP multicast address receive the associated solicitation requests. In a big network, this can drastically reduce the number of solicitations that hosts receive and discard because they are not the target.

  • ND 使用邻居不可达检测算法来检测失效邻居。这并不是每个 ARP 实现的一部分,但正如我们在第 27 章中看到的,Linux 也为 ARP 实现了它。

  • ND uses a neighbor unreachability detection algorithm to detect dead neighbors. This is not part of every ARP implementation, but as we saw in Chapter 27, Linux implements it for ARP as well.




[ * ]arphdr结构不包含 ARP 帧最后四个字段(地址)的占位符;通过简单地读取 Oper 字段末尾即可提取这些内容,这要归功于 HS 和 PS 字段。

[*] The arphdr structure does not contain placeholders for the last four fields of the ARP frame (the addresses); those are extracted by simply reading past the end of the Oper field, which is made possible thanks to the HS and PS fields.

[ * ]为方便起见,图中的 MAC 地址已被截断。例如,00:...:03 代表 00:00:00:00:00:03。我使用了像这样的简单 MAC 地址来简化图。

[*] The MAC addresses in the figure are truncated for convenience. For example, 00:...:03 stands for 00:00:00:00:00:03. I used simple MAC addresses like that one to simplify the figure.

[ * ]使用“可调 ARP 选项”部分中描述的选项,您可以使 Linux 的行为就像 IP 地址属于接口一样。有关此设计的有趣讨论,包括其优点和缺点,您可以参考netdev邮件列表上的(相当长的)线程“ARP 在所有设备上响应”,该线程存档于http://oss.sgi.com /archives/netdev

[*] Using the options described in the section "Tunable ARP Options," you can make Linux behave as if IP addresses belonged to the interfaces. For an interesting discussion of this design, including its advantages and disadvantages, you can refer to the (pretty long) thread "ARP responds on all devices" on the netdev mailing list, which is archived at http://oss.sgi.com/archives/netdev.

[ * ] 127 .xxx地址是一个例外;对它们的 ARP 请求永远不会被处理。

[*] 127.x.x.x addresses are an exception; ARP requests for them are never handled.

[ * ]除非内核支持多路径缓存。该功能在第 33 章中进行了描述。

[*] Unless the kernel comes with support for multipath caching. That feature is described in Chapter 33.

[ * ]这是 Julian Anastasov 和 Alexey Kuznetsov 在http://www.ssi.bg/~ja/medium_id.txt提供的示例的扩展版本。该文档还描述了此功能有用的常见场景。

[*] This is an extended version of the example provided by Julian Anastasov and Alexey Kuznetsov at http://www.ssi.bg/~ja/medium_id.txt. The document also describes a common scenario where this feature can be useful.

[ * ]这与 IP 数据包分段无关。详细内容参见第 2 章

[*] This has nothing to do with IP packet fragmentation. Details are in Chapter 2.

[ * ]LOOPBACK识别地址 127.xxx识别MULTICAST 地址 224.xxx D 类)。

[*] LOOPBACK recognizes the addresses 127.x.x.x and MULTICAST recognizes the addresses 224.x.x.x (class D).

[ * ] Linux 支持的所有 NAT(SNAT、DNAT、伪装等)均由 Netfilter 实现。因为本书不涉及 Netfilter 的内部结构,所以我也没有在书中讨论 NAT。有关 NAT 风格之间差异的讨论,您可以参考 Netfilter 项目的主页http://www.netfilter.org

[*] All flavors of NAT supported by Linux—SNAT, DNAT, Masquerading, etc.—are implemented by Netfilter. Because this book does not cover the Netfilter internals, I have not included a discussion on NAT in the book either. For a discussion of the differences between the NAT flavors, you can refer to the Netfilter project's home page, http://www.netfilter.org.

[ * ]如果我们排除特殊功能(例如桥接)的使用,则该陈述是正确的。

[*] The statement is correct if we exclude the use of special features, like bridging.

[ * ]请注意,即使主机 F 物理上位于 LAN1 中,但仍保留其 LAN3 地址(这在大多数情况下是配置错误,如第 26 章中的图 26-1(c) 所示),情况也是如此。 。

[*] Note that it would be true even if Host F were physically in LAN1 while still keeping its address of LAN3 (which would, in most cases, be a misconfiguration, as shown in Figure 26-1(c) in Chapter 26).

[ * ]准确地说,该条目是从协议无关的缓存中删除的,这将在第33章中详细介绍。

[*] To be exact, the entry is removed from the protocol-independent cache, which is covered in detail in Chapter 33.

第 29 章相邻子系统:其他主题

Chapter 29. Neighboring Subsystem: Miscellaneous Topics

通过本章,我们结束了本书有关邻居协议的部分。本章展示了用于配置相邻协议的用户空间命令如何与内核交互,以易于阅读的表格总结了前三章中介绍的变量和函数,最后详细描述了所使用的主要数据结构由邻近的子系统。

With this chapter, we conclude the part of this book on the neighboring protocol. The chapter shows how the user-space commands used to configure neighboring protocols interact with the kernel, summarizes the variables and functions introduced in the previous three chapters in easy-to-read tables, and concludes with a detailed description of the main data structures used by the neighboring subsystem.

邻居系统管理

System Administration of Neighbors

可以使用两个用户空间工具添加、删除和修改邻居条目:

Neighbor entries can be added, removed, and modified with two user-space tools:

ARP
arp

这是较旧的工具。它是net-tools软件包的一部分 ,其中包括其他常用命令,例如ifconfigroutenetstat等。arp 仅处理 IPv4 相邻协议 ARP 的条目,如名称所示。与其同伴一样,arp使用ioctl调用与内核进行通信。

This is the older tool. It is part of the net-tools package, which includes other common commands such as ifconfig, route, netstat, etc. arp handles entries only for the IPv4 neighboring protocol ARP, as the name indicates. Like its companions, arp uses ioctl calls to communicate with the kernel.

ip
ip

这被认为是当前的工具。ip命令 是 IPROUTE2 软件包的一部分,用于配置各种网络子系统(路由、流量控制等)。它可用于配置任何相邻协议,并使用 Netlink 套接字与内核进行通信。

This is considered the current tool. The ip command is part of the IPROUTE2 package and is used to configure a wide range of networking subsystems (routing, traffic control, etc.). It can be used to configure any neighboring protocol, and it talks to the kernel using the Netlink socket.

这两个工具还可用于配置基于目标的代理。

Both tools can also be used to configure destination-based proxying.

本章不会详细介绍命令的语法、功能或实现,但值得了解当命令操作条目时内核端执行的内容neighbour

This chapter does not go into detail on the commands' syntax, features, or implementation, but it is worth knowing what is executed on the kernel side when the commands manipulate a neighbour entry.

接下来的三节概述了如何将配置命令传播到内核。对于 IPROUTE2,我还将简要展示用户空间代码是如何组织的。

The next three sections give you an overview of how configuration commands are propagated to the kernel. In the case of IPROUTE2, I'll also show briefly how the user-space code is organized.

常用例程

Common Routines

尽管iparp使用不同的机制与内核通信,因此,正如我们将在接下来的两节中看到的,使用不同的内核处理程序来处理配置命令,但最终它们实际上通过同一组与相邻层通信例程:

Even though ip and arp use different mechanisms to talk to the kernel and therefore, as we will see in the next two sections, use different kernel handlers to process the configuration commands, in the end they actually talk to the neighboring layer via the same set of routines:

查找例程
Lookup routines

在对现有条目进行更改或添加新条目之前,内核需要在缓存中进行查找。这些查找是使用第 27 章“缓存”部分中描述的函数完成的。

Before applying a change to an existing entry or adding a new one, the kernel needs to do a lookup in the cache. These lookups are done using the functions described in the section "Caching" in Chapter 27.

neigh_update
neigh_update

neigh_update是一个通用例程,可以根据其输入参数完成各种不同的操作。该功能在第 27 章“更新邻居信息:neigh_update ”一节中描述。

neigh_update is a generic routine that can accomplish a variety of different operations depending on its input parameters. The function is described in the section "Updating a Neighbor's Information: neigh_update" in Chapter 27.

pneigh_update
pneigh_update

pneigh_updateused 而不是neigh_update通过目标代理。请参阅第 27 章中的“充当代理”部分。

pneigh_update is used instead of neigh_update by destination proxying. See the section "Acting As a Proxy" in Chapter 27.

必要时,查找例程使用第 27 章中描述的和 例程创建或删除neighbour条目。neigh_addneigh_destroy

The lookup routines, when necessary, create or delete neighbour entries with the neigh_add and neigh_destroy routines described in Chapter 27.

图 29-1总结了本节和上一节中描述的关系。

Figure 29-1 summarizes the relationships described in this section and the previous one.

用户空间和内核之间的 arp 和 ip neighbor 接口

图 29-1。用户空间和内核之间的 arp 和 ip neighbor 接口

Figure 29-1. Interface between the user space and the kernel for arp and ip neighbour

新一代工具:IPROUTE2 的 ip 命令

New-Generation Tool: IPROUTE2's ip Command

ip是一个通用命令,它取代了许多传统的 Unix 命令,例如ifconfigroutearpip命令的第一个参数 —地址路由邻居等 — 指示ip作用的对象,以及它是否执行ifconfigroutearp等工作。就内核而言,ip对象决定了命令与哪个子系统交互。

ip is a generic command that replaces a number of traditional Unix commands such as ifconfig, route, and arp. The first argument of the ip command—address, route, neighbour, etc.—indicates the object that ip acts on, and thus whether it does the job of ifconfig, route, arp, and so on. In terms of the kernel, the ip object determines what subsystem the command interacts with.

用于配置邻居协议的命令是以ip neighbour开头的命令 。图29-2显示了IPROUTE2包中实现邻居协议配置的关键文件和函数。

The commands used to configure neighboring protocols are the ones that start with ip neighbour. Figure 29-2 shows the key files and functions of the IPROUTE2 package that implement the configuration of neighboring protocols.

IPROUTE2包的邻居文件和函数的结构

图 29-2。IPROUTE2包的邻居文件和函数的结构

Figure 29-2. Structure of IPROUTE2 package's neighbor files and functions

ip的第二个参数是指示管理员想要对子系统执行什么操作的命令。表 29-1总结了这些命令并指出了内核代码中相应的操作、标志和处理程序。因此,命令ip neighbor add ...向相邻子系统添加一个新条目,向内核发送一条RTM_NEWNEIGH命令,其中包含NLM_F_CREATE(如果不存在则创建一个条目)和NLM_F_EXCL(如果存在则保留一个条目) ) 标志设置。该命令由内核处理程序处理neigh_add

The second argument to ip is the command that indicates what the administrator wants to do to the subsystem. Table 29-1 summarizes the commands and indicates the corresponding operation, flags, and handler in the kernel code. Thus, the command ip neighbour add ..., which adds a new entry to the neighboring subsystem, sends the kernel a RTM_NEWNEIGH command with both the NLM_F_CREATE (create an entry if one doesn't exist) and NLM_F_EXCL (leave an entry alone if it does exist) flags set. The command is taken care of by the kernel handler neigh_add.

表 29-1。由 IPROUTE2 中的 do_ipneigh 和相关内核处理程序设置的参数

Table 29-1. Parameters set by do_ipneigh in IPROUTE2 and associated kernel handlers

命令行关键字

Command-line keyword

手术

Operation

旗帜

Flags

内核处理程序

Kernel handler

添加

add

RTM_NEWNEIGH

RTM_NEWNEIGH

NLM_F_CREATE

NLM_F_CREATE

NLM_F_EXCL

NLM_F_EXCL

邻居添加

neigh_add

改变,改变

change, chg

RTM_NEWNEIGH

RTM_NEWNEIGH

NLM_F_REPLACE

NLM_F_REPLACE

邻居添加

neigh_add

代替

replace

RTM_NEWNEIGH

RTM_NEWNEIGH

NLM_F_CREATE

NLM_F_CREATE

NLM_F_REPLACE

NLM_F_REPLACE

邻居添加

neigh_add

删除

delete

RTM_DELNEIGH

RTM_DELNEIGH

没有任何

None

邻居删除

neigh_delete

显示、列表、lst

show, list, lst

RTM_GETNEIGH

RTM_GETNEIGH

NLM_F

NLM_F

邻居转储信息

neigh_dump_info

冲水

flush

RTM_GETNEIGH

RTM_GETNEIGH

NLM_F

NLM_F

邻居转储信息

neigh_dump_info

如果您查看列出的内核函数之一(例如 ),neigh_add借助表 29-1,您应该能够识别该函数的每个部分的作用。当然,还需要对 Netlink 层有最低限度的了解,例如,了解如何解析输入数据。第3章介绍了Netlink ;然而,由于空间不足,无法遮盖其内部结构。

If you look at one of the kernel functions listed, such as neigh_add, thanks to Table 29-1 you should be able to identify what each part of the function does. Of course, a minimal knowledge of the Netlink layer is also required, for example, to understand how input data is parsed. Netlink is introduced in Chapter 3; however, its internals could not be covered for lack of space.

老一代工具:net-tools 的 arp 命令

Old-Generation Tool: net-tools's arp Command

那些更喜欢旧的 Unix 命令而不是 IPROUTE2 包的人在极少数需要手动操作主机的 ARP 表的情况下使用arp(当然,该命令无法提供其他相邻协议)。表 29-2列出了主要的arp命令以及处理它们的内核处理程序。该表还显示了实现相同功能的ip neigh命令。请注意,没有arp命令对应于ip neigh changeip neigh_replace(相反,会发出删除,然后发出添加)。

People who prefer the old Unix commands to the IPROUTE2 package use arp on the rare occasion that they need to manipulate a host's ARP tables by hand (the command has nothing to offer other neighboring protocols, of course). Table 29-2 lists the main arp commands along with the kernel handlers that process them. The table also shows the ip neigh command that achieves the same functionality. Note that no arp command corresponds to ip neigh change or ip neigh_replace (instead, one would issue a delete followed by an add).

表29-2。arp命令、对应的ip命令以及调用的内核函数

Table 29-2. arp commands, corresponding ip commands, and kernel functions invoked

 

用户空间命令

User-space command

net-tools 调用的内核函数

Kernel function invoked by net-tools

网络工具

net-tools

路由2

IPROUTE2

 

arp -s ...

arp -s ...

ip neigh add ...

ip neigh add ...

arp_req_set

arp_req_set

arp -d ...

arp -d ...

ip neigh del ...

ip neigh del ...

arp_req_delete

arp_req_delete

arp

arp

ip neigh show ...

ip neigh show ...

/proc/net/arp文件

/proc/net/arp file

这些例程在net/ipv4/arp.carp_req_ xxx中定义。在同一个文件中,您可以找到操作虚拟/proc/net/arp文件的例程。 arp读取此文件而不是向内核发出调用来获取信息,即使内核提供了可以执行请求的例程。请参阅net/ipv4/arp.c中该结构体的定义 ,以了解有关/proc 文件使用的更多信息。ioctlarp_req_getarp_seq_ops

The arp_req_ xxx routines are defined in net/ipv4/arp.c. In the same file, you can find the routines that manipulate the virtual /proc/net/arp file. arp reads this file instead of issuing ioctl calls to the kernel to obtain information, even though the kernel provides a routine named arp_req_get that can perform the request. See the definition of the arp_seq_ops structure in net/ipv4/arp.c to find out more about the use of the /proc file.

通过 /proc 文件系统进行调整

Tuning via /proc Filesystem

正如我们在前面的章节中看到的,相邻协议遵循常见的内核实践,即在/proc目录中提供方便的接口,以便管理员调整子系统的参数。邻近子系统 的参数驻留在四个目录中,两个用于 IPv4,两个用于 IPv6:

As we saw in an earlier chapter, the neighboring protocols follow the common kernel practice of offering a convenient interface in the /proc directory to let administrators tune the subsystem's parameters. The neighboring subsystem 's parameters reside in four directories, two for IPv4 and two for IPv6:

/proc/sys/net/ipv4/neigh
/proc/sys/net/ipv4/neigh

/proc/sys/net/ipv6/neigh
/proc/sys/net/ipv6/neigh

相邻子系统的通用参数,例如用于控制缓存操作何时发生的计时器

Generic parameters of the neighboring subsystem, such as the timers used to control when cache operations take place

/proc/sys/net/ipv4/conf
/proc/sys/net/ipv4/conf

/proc/sys/net/ipv6/conf
/proc/sys/net/ipv6/conf

协议中的特定行为,例如第 28 章“可调 ARP 选项” 部分中描述的行为

Particular behaviors within the protocol, such as the ones described in the section "Tunable ARP Options" in Chapter 28

每个目录都包含系统上每个 NIC 设备的子目录、默认子目录以及(对于conf目录) 可用于一次性将更改应用到所有设备的all子目录。在conf下 ,default子目录显示每个功能的全局状态,而在 neigh下,default子目录显示每个功能的默认设置(即配置参数)。默认子目录的值用于在创建每个设备的子目录时对其进行初始化。

Each directory contains a subdirectory for each NIC device on the system, a default subdirectory, and (in the case of the conf directory) an all subdirectory that can be used to apply a change to all the devices at once. Under conf, the default subdirectory shows the global status of each feature, while under neigh, the default subdirectory shows the default setting (i.e., configuration parameters) of each feature. The values of the default subdirectories are used to initialize the per-device subdirectories when the latter are created.

各个设备的目录优先于更通用的目录。但并非所有设备都会关注所有参数;如果参数与设备无关,则关联的目录包含该参数的文件,但内核会忽略它。例如,该gc_thresh1 值不被任何协议使用,只有 IPv4 使用locktime

The directories for individual devices take precedence over the more general directories. But not all devices pay attention to all the parameters; if a parameter is not relevant to a device, the associated directory contains a file for the parameter but the kernel ignores it. For instance, the gc_thresh1 value is not used by any protocol, and only IPv4 uses locktime.

图 29-3显示了文件的布局以及注册它们的例程。

Figure 29-3 shows the layout of the files and the routines that register them.

图29-3右上角的三个文件arparp_cachendisc_cache不用于配置任何内容,仅用于导出只读数据。请注意,它们位于 /proc/net目录中,而不是/proc/sys中。/proc/net/arparp命令用来转储 ARP 缓存的内容(ND 没有对应项),如“老一代工具:net-tools 的 arp 命令”部分中所述。/ proc /net/stat/ _cache xxx 文件导出有关协议缓存的统计信息。它们的大多数文件表示结构字段,如“ neigh_statistics 结构neigh_statistics”部分中所述。

The three files arp, arp_cache, and ndisc_cache at the top-right corner of Figure 29-3 are not used to configure anything, but just to export read-only data. Note that they are in the /proc/net directory, not in /proc/sys. /proc/net/arp is used by the arp command to dump the contents of the ARP cache (there is no counterpart for ND), as discussed in the section "Old-Generation Tool: net-tools's arp Command." The /proc/net/stat/ xxx _cache files export statistics about the protocol caches. Most of their files represent fields of neigh_statistics structures, described in the section "neigh_statistics Structure."

/proc/sys/net/ipv4/neigh 目录

The /proc/sys/net/ipv4/neigh Directory

该目录包含来自结构体的参数,这些参数在第 27 章neigh_parms中介绍过。正如该章所解释的,每个设备对于与其交互的每个相邻协议都有一个结构(参见第 27 章中的图 27-2)。我们还看到 结构中包含另一个实例来存储默认值。neigh_parmsneigh_parmsneigh_table

This directory contains parameters from neigh_parms structures, which were introduced in Chapter 27. As that chapter explained, each device has one neigh_parms structure for each neighboring protocol that it interacts with (see Figure 27-2 in Chapter 27). We have also seen that another neigh_parms instance is included in the neigh_table structure to store default values.

但是,并非该neigh_parms 结构的所有字段都导出到/proc。例如, reachable_time是一个派生字段,其值是间接计算出来的base_reachable_time,因此用户无法更改。另外,tblneigh_setup是内核用来组织其数据结构的,与协议本身没有任何关系,因此它们不被导出。

However, not all fields of the neigh_parms structure are exported to /proc. For instance, reachable_time is a derived field whose value is indirectly calculated from base_reachable_time and therefore cannot be changed by the user. In addition, tbl and neigh_setup are used by the kernel to organize its data structures and do not have anything to do with the protocol itself, so they are not exported.

除了将neigh_parms结构中的大部分参数导出到/proc之外,相邻子系统也从结构中导出一些参数neigh_table

In addition to exporting most of the parameters in the neigh_parms structure to /proc, the neighboring subsystem exports a few from the neigh_table structure, too.

全局目录和每设备目录的初始化

Initialization of global and per-device directories

因为默认值是由协议本身提供的,所以当协议初始化时(请参阅和函数)会安装默认子目录,并用名称基于结构中关联字段的文件填充。表29-3中各字段的默认值可以直接在表的初始化中找到 ;第 28 章展示了 ARP 的示例。arp_initndisc_initneigh_parmsxxx _tbl

Because the default values are provided by the protocol itself, the default subdirectory is installed when the protocol is initialized (see the arp_init and ndisc_init functions) and populated with files whose names are based on those of the associated fields in the neigh_parms structure. You can find the default values of the fields in Table 29-3 directly in the initializations of the xxx _tbl tables; Chapter 28 shows an example for ARP.

相邻子系统的 /proc/sys 文件注册示例

图 29-3。相邻子系统的 /proc/sys 文件注册示例

Figure 29-3. Example of /proc/sys file registration for the neighboring subsystem

内核变量与/proc/sys/net/ipv4/neigh/ /中文件名的关系 如表29-3所示。参见net/core/neighbour.c中的初始化 ;第 3 章提供了阅读该模板的指南。xxx neigh_sysctl_template

The relationships between the kernel variables and the names of the files in /proc/sys/net/ipv4/neigh/ xxx / are shown in Table 29-3. See the initialization of neigh_sysctl_template in net/core/neighbour.c; a guide to reading the template is in Chapter 3.

表29-3。/proc/sys/net/ipv4/neigh 子目录中的内核变量和关联文件

Table 29-3. Kernel variables and associated files in /proc/sys/net/ipv4/neigh subdirectories

内核变量名

Kernel variable name

文件名

Filename

IPv4/IPv6 的默认值

Default value for IPv4/IPv6

mcast_探针

mcast_probes

mcast_solicit

mcast_solicit

3

3

ucast_探针

ucast_probes

ucast_solicit

ucast_solicit

3

3

应用程序探针

app_probes

app_solicit

app_solicit

0

0

重传时间

retrans_time

retrans_time

retrans_time

100*赫兹

100 * HZ

基本可达时间

base_reachable_time

base_reachable_time

base_reachable_time

30*赫兹

30 * HZ

延迟探测时间

delay_probe_time

delay_first_probe_time

delay_first_probe_time

5*赫兹

5 * HZ

GC_staletime

gc_staletime

gc_stale_time

gc_stale_time

60*赫兹

60 * HZ

队列长度

queue_len

unres_qlen

unres_qlen

3

3

proxy_qlen

proxy_qlen

proxy_qlen

proxy_qlen

64

64

任播延迟

anycast_delay

anycast_delay

anycast_delay

1*赫兹

1 * HZ

代理延迟

proxy_delay

proxy_delay

proxy_delay

(8*赫兹)/10

(8*HZ)/10

锁定时间

locktime

locktime

locktime

1*赫兹

1 * HZ

垃圾回收间隔

gc_interval

gc_interval

gc_interval

30*赫兹

30 * HZ

GC阈值1

gc_thresh1

gc_thresh1

gc_thresh1

128

128

GC阈值2

gc_thresh2

gc_thresh2

gc_thresh2

第512章

512

GC_thresh3

gc_thresh3

gc_thresh3

gc_thresh3

1,024

1,024

每个设备的目录是在首次配置设备时创建的。第一次在设备D上配置地址时,会在/proc/sys/net/ipv4/neigh下创建一个名为D的目录。所有参数都适用于设备而不是特定地址,因此每个设备只有一个目录,即使它配置了多个地址。

Each device's directories are created when the device is first configured. The first time an address is configured on device D, a directory with the name D is created under /proc/sys/net/ipv4/neigh. All of the parameters apply to the device rather than to a specific address, so there is only a single directory for each device, even if it is configured with multiple addresses.

图 29-3显示了如果主机具有三个名为eth0eth1eth2的设备,您将看到的目录树;如果eth0eth1已被分配 IPv4 地址;如果 eth0也被赋予了 IPv6 地址;如果eth2尚未配置。

Figure 29-3 shows the directory tree you would see if a host had three devices named eth0, eth1, and eth2; if eth0 and eth1 had been given IPv4 addresses; if eth0 had also been given an IPv6 address; and if eth2 has not been configured yet.

负责配置 IPv4 和 IPv6 设备的两个函数分别是inetdev_initip6_add_dev。每个调用都会在/procneigh_sysctl_register下创建设备的子目录,如下节所述。

The two functions in charge of configuring IPv4 and IPv6 devices are inetdev_init and ip6_add_dev, respectively. Each calls neigh_sysctl_register to create the device's subdirectory under /proc, as described in the following section.

目录创建

Directory creation

/proc/sys/net/ipv4/neigh中的默认目录和 每个设备目录 都是使用该函数创建的 。后者通过使用输入参数的值来区分这两种情况 。如果我们以 IPv4 为例,您可以比较(协议初始化函数)和(设备的配置块初始化函数)调用方式。需要区分这两种情况:neigh_sysctl_registerdevarp_initinetdev_initneigh_sysctl_registerneigh_sysctl_register

Both the default and the per-device directories in /proc/sys/net/ipv4/neigh are created with the neigh_sysctl_register function. The latter differentiates between the two cases by using the value of the input parameter dev. If we take IPv4 as an example, you can compare the way arp_init (a protocol initialization function) and inetdev_init (a device's configuration block initializer) call neigh_sysctl_register. neigh_sysctl_register needs to differentiate between the two cases to:

  • 选择要创建的目录的名称。当为 NULL 时为默认值dev,否则从设备本身(dev->name)中提取。

  • Pick the name of the directory to create. It will be default when dev is NULL, and extracted from the device itself (dev->name) otherwise.

  • 决定将哪些参数作为文件添加到该目录;该default目录将比其他目录包含更多参数(确切地说是四个)。虽然 中 提取的参数neigh_parms在按设备配置时有意义,但 中 的参数则neigh_table不然。因此,从 go 中获取的四个参数neigh_table仅在默认目录中(参见表29-3 末尾)。这四个参数与垃圾收集过程相关:

    • gc_interval

    • gc_thresh1, gc_thresh2, gc_thresh3

  • Decide what parameters to add as files to that directory; the default directory will include a few more parameters than the others (four to be exact). While the parameters extracted from neigh_parms are meaningful when configured on a per-device basis, the ones in neigh_table are not. Thus, the four parameters taken from neigh_table go only in the default directory (see the end of Table 29-3). Those four parameters are related to the garbage collection process:

    • gc_interval

    • gc_thresh1, gc_thresh2, gc_thresh3

输入参数的含义如下neigh_sysctl_register

Here is the meaning of the input parameters to neigh_sysctl_register:

struct net_device *dev
struct net_device *dev

与正在创建的目录关联的设备。当dev为NULL时,表示已调用该函数来创建默认目录。

Device associated with the directory being created. When dev is NULL, it means the function has been invoked to create the default directory.

struct neigh_parms *p
struct neigh_parms *p

将导出其参数的结构。例如,使用 ARP 的设备会通过in_dev->arp_parms. 当 dev为 NULL 时,这是neigh_parms嵌入在协议 neigh_table结构 ( neigh_table->neigh_parms) 中的实例,它存储协议的默认值。

Structure whose parameters will be exported. A device using ARP, for instance, passes in_dev->arp_parms. When dev is NULL, this is the neigh_parms instance embedded in the protocol's neigh_table structure (neigh_table->neigh_parms), which stores the protocol's defaults.

int p_id
int p_id

协议标识符。请参阅include/linux/sysctl.hNET_ XXX中的值。例如,ARP 使用.NET_IPV4

Protocol identifier. See the NET_ XXX values in include/linux/sysctl.h. ARP, for instance, uses NET_IPV4.

int pdev_id
int pdev_id

正在导出的参数的类标识符。请参阅include/linux/sysctl.hNET_IPV4_ XXX中的值。例如,ARP 使用.NET_IPV4_NEIGH

Class identifier of parameters being exported. See the NET_IPV4_ XXX values in include/linux/sysctl.h. ARP, for example, uses NET_IPV4_NEIGH.

char *p_name
char *p_name

指示引用相邻协议字段的 L3 协议的字符串。例如,ARP 使用“ipv4”。

String indicating the L3 protocol that refers to the neighboring protocol fields. ARP, for example, uses "ipv4".

proc_handler *handler
proc_handler *handler

当用户修改导出字段之一的值时内核调用的函数。只有 IPv6 传递非 NULL 值,并且它提供的函数只是内核将安装的默认处理程序的包装器。请参阅net/ipv6/ndisc.cndisc_ifinfo_sysctl_change中的示例。

Function that the kernel invokes when the value of one of the exported fields is modified by the user. Only IPv6 passes a non-NULL value, and the function it provides is simply a wrapper to the default handler that the kernel would install otherwise. See ndisc_ifinfo_sysctl_change in net/ipv6/ndisc.c for an example.

该函数中唯一棘手的部分是如何gc_ xxx从结构中提取四个参数neigh_table。它依赖于内存布局的技巧:与垃圾收集相关的四个参数存储在结构体中neigh_table紧随其后的位置neigh_parms,如下所示:

The only tricky part in the function is how the four gc_ xxx parameters are extracted from the neigh_table structure. It relies on a trick of memory layout: the four parameters related to garbage collection are stored in the neigh_table structure right after the neigh_parms structure, as shown here:

结构邻居表
        ...
        结构 neigh_parms 参数;
        int gc_interval;
        int gc_thresh1;
        int gc_thresh2;
        int gc_thresh3;
        ...
struct neigh_table
        ...
        struct neigh_parms parms;
        int gc_interval;
        int gc_thresh1;
        int gc_thresh2;
        int gc_thresh3;
        ...

因此,要检索值,该函数所需要做的neigh_table就是遍历neigh_parms,将指针转换为整数,并连续提取四个整数:

Thus, all the function needs to do to retrieve the neigh_table values is to go past neigh_parms, cast the pointer to an integer, and extract four integers in a row:

    如果(开发){
        dev_name_source = dev->名称;
        t->neigh_dev[0].ctl_name = dev->ifindex;
        memset(&t->neigh_vars[12], 0, sizeof(ctl_table));
    } 别的 {
        t->neigh_vars[12].data = (int *)(p + 1);
        t->neigh_vars[13].data = (int *)(p + 1) + 1;
        t->neigh_vars[14].data = (int *)(p + 1) + 2;
        t->neigh_vars[15].data = (int *)(p + 1) + 3;
    }
    if (dev) {
        dev_name_source = dev->name;
        t->neigh_dev[0].ctl_name = dev->ifindex;
        memset(&t->neigh_vars[12], 0, sizeof(ctl_table));
    } else {
        t->neigh_vars[12].data = (int *)(p + 1);
        t->neigh_vars[13].data = (int *)(p + 1) + 1;
        t->neigh_vars[14].data = (int *)(p + 1) + 2;
        t->neigh_vars[15].data = (int *)(p + 1) + 3;
    }

/proc/sys/net/ipv4/conf 目录

The /proc/sys/net/ipv4/conf Directory

/proc/sys/net/ipv4/conf子目录中的文件与结构体的字段相关联ipv4_devconf,该结构体在include/linux/inetdevice.h中定义。并非它的所有字段都被相邻协议使用(有关其他字段,请参阅第 23 章和第 36章)。表29-4列出了与邻居协议相关的参数;它们的含义在第 28 章“可调 ARP 选项”部分中进行了描述。

The files in the /proc/sys/net/ipv4/conf subdirectories are associated with the fields of the ipv4_devconf structure, which is defined in include/linux/inetdevice.h. Not all of its fields are used by the neighboring protocols (see Chapters 23 and 36 for the other fields). Table 29-4 lists the parameters relevant to the neighboring protocols; their meanings were described in the section "Tunable ARP Options" in Chapter 28.

表 29-4。/proc/sys/net/ipv4/conf 子目录中的内核变量和关联文件

Table 29-4. Kernel variables and associated files in /proc/sys/net/ipv4/conf subdirectories

内核变量名

Kernel variable name

文件名

Filename

IPv4/IPv6 的默认值

Default value for IPv4/IPv6

ipv4_devconf.arp_announce

ipv4_devconf.arp_announce

arp_announce

arp_announce

0

0

ipv4_devconf.arp_filter

ipv4_devconf.arp_filter

arp_filter

arp_filter

0

0

ipv4_devconf.arp_ignore

ipv4_devconf.arp_ignore

arp_ignore

arp_ignore

0

0

ipv4_devconf.medium_id

ipv4_devconf.medium_id

medium_id

medium_id

0

0

ipv4_devconf.proxy_arp

ipv4_devconf.proxy_arp

proxy_arp

proxy_arp

0

0

如图29-3所示,除了每个设备的子目录之外,还有两个特殊的子目录,分别名为 defaultall。详细信息请参见第 36 章。

As shown in Figure 29-3, in addition to the per-device subdirectories, there are also two special ones named default and all. See Chapter 36 for more details.

本书这一部分介绍的数据结构

Data Structures Featured in This Part of the Book

在第27章的“主要数据结构”部分中,我们简要概述了相邻子系统使用的主要数据结构。本节详细介绍了每个数据结构的字段。

In the section "Main Data Structures" in Chapter 27, we had a brief overview of the main data structures used by the neighboring subsystem. This section presents a detailed description of each data structure's field.

图 29-4显示了定义每个数据结构的文件。颜色较浅的不属于相邻子系统的一部分,但我在本书的这一部分中提到了它们。

Figure 29-4 shows the files that define each data structure. The ones with a lighter color are not part of the neighboring subsystem, but I referred to them in this part of the book.

内核文件中数据结构的分布

图 29-4。内核文件中数据结构的分布

Figure 29-4. Distribution of data structures in kernel files

邻居结构

neighbour Structure

邻居由结构表示struct neighbour。该结构很复杂,包括状态字段、与 L3 协议接口的虚拟功能、计时器和缓存的 L2 标头。

Neighbors are represented by struct neighbour structures. The structure is complex and includes status fields, virtual functions to interface with L3 protocols, timers, and cached L2 headers.

以下是逐个字段的描述:

Here is a field-by-field description:

struct neighbour *next
struct neighbour *next

每个neighbour条目都插入到哈希表中。next将结构链接到其他发生碰撞并共享同一桶的结构。元素总是插入到列表的头部(参见“创建邻居条目”部分和第 27 章中的图 27-2)。

Each neighbour entry is inserted in a hash table. next links the structure to the other ones that collide and share the same bucket. Elements are always inserted at the head of the list (see the section "Creating a neighbour Entry," and Figure 27-2 in Chapter 27).

struct neigh_table *tbl
struct neigh_table *tbl

指向neigh_table定义与此条目关联的协议的结构的指针。例如,如果邻居是 IPv4 地址,则tbl指向arp_tbl

Pointer to the neigh_table structure that defines the protocol associated with this entry. If the neighbor is an IPv4 address, for instance, tbl points to arp_tbl.

struct neigh_parms *parms
struct neigh_parms *parms

用于调整相邻协议行为的参数。neighbour创建结构时,使用嵌入协议关联结构中的parms默认结构的值进行初始化 。当协议的方法被调用时(例如,对于 ARP),该块将被替换为关联设备的配置块(如果有)。虽然大多数设备使用系统默认设置,但设备可以使用不同的参数启动,或者稍后由管理员配置为使用不同的参数,如本章前面所述。neigh_parmsneigh_tableconstructorneigh_createarp_constructor

Parameters used to tune the neighboring protocol behavior. When a neighbour structure is created, parms is initialized with the values of the default neigh_parms structure embedded in the protocol's associated neigh_table structure. When the protocol's constructor method is called by neigh_create (e.g., arp_constructor for ARP), that block is replaced with the configuration block of the associated device, if any. While most devices use the system defaults, a device can start up with different parameters or be configured by the administrator later to use different parameters, as discussed earlier in this chapter.

struct net_device *dev
struct net_device *dev

可通过其到达邻居的设备。只能使用一台设备来到达每个邻居。因此,值 NULL 永远不会出现在此处,就像在其他使用它作为通配符来引用所有设备的内核子系统中一样。

The device through which the neighbor is reachable. Only one device can be used to reach each neighbor. Thus, the value NULL never appears here as it does in other kernel subsystems that use it as a wildcard to refer to all devices.

unsigned long confirmed
unsigned long confirmed

iffies最近确认条目的可达性时的时间戳(在 j 中)。L4协议可以对其进行更新neigh_confirm(参见第26章中的图26-14)。相邻基础设施对其进行更新,如 中所述。neigh_update

Timestamp (in jiffies) when the reachability of the entry was most recently confirmed. L4 protocols can update it with neigh_confirm (see Figure 26-14 in Chapter 26). The neighboring infrastructure updates it in neigh_update, described in .

unsigned long updated
unsigned long updated

最近一次更新条目的时间戳neigh_update(唯一的例外是第一次初始化neigh_alloc)。不要混淆 updatedconfirmed,它们跟踪非常不同的事物。updated当邻居的状态发生变化时,该字段被设置,而confirmed该字段仅记录一种特定的状态变化:最近确认条目有效时发生的状态变化。

Timestamp of the most recent time the entry was updated by neigh_update (the only exception is the first initialization by neigh_alloc). Do not confuse updated and confirmed, which keep track of very different things. The updated field is set when the state of a neighbor changes, whereas the confirmed field merely records one particular change of state: the one that occurs when the entry was most recently confirmed to be valid.

unsigned long used
unsigned long used

最近使用该条目的时间。它的值并不总是与数据传输同步更新。当条目不处于状态时NUD_CONNECTED,该字段由 更新neigh_event_send,由 调用neigh_resolve_output。相反,当条目处于状态时NUD_CONNECTED,其值有时会更新neigh_periodic_timer到最近确认条目的可达性的时间。

Most recent time the entry was used. Its value is not always updated synchronously with the data transmissions. When the entry is not in the NUD_CONNECTED state, this field is updated by neigh_event_send, which is called by neigh_resolve_output. In contrast, when the entry is in the NUD_CONNECTED state, its value is sometimes updated by neigh_periodic_timer to the time the entry's reachability was most recently confirmed.

_ _u8 flags
_ _u8 flags

该字段的可能值列在include/linux/rtnetlink.hinclude/net/neighbour.h中:

#define NTF_PROXY 0x08

ip neigh用户空间命令用于向代理表添加条目时(例如ip neigh add proxy 10.0.0.2 dev eth0),会在发送到内核的数据结构中设置此标志,以让内核处理程序知道必须将新条目添加到代理表中(请参阅“邻居的系统管理neigh_add”部分)。

#define NTF_ROUTER 0x80

该标志仅由 IPv6 使用。设置后,表示邻居是路由器。与 不同的是NTF_PROXY,该标志不是由用户空间工具设置的。当接收到来自邻居的信息时,IPv6 邻居发现代码会更新其值。

Possible values for this field are listed in include/linux/rtnetlink.h and include/net/neighbour.h:

#define NTF_PROXY 0x08

When the ip neigh user-space command is used to add entries to the proxy tables (for instance, ip neigh add proxy 10.0.0.2 dev eth0), this flag is set in the data structure sent to the kernel, to let the kernel handler neigh_add know that the new entry has to be added to the proxy table (see the section "System Administration of Neighbors").

#define NTF_ROUTER 0x80

This flag is used only by IPv6. When set, it means the neighbor is a router. Unlike NTF_PROXY, this flag is not set by user-space tools. The IPv6 neighbor discovery code updates its value when receiving information from the neighbor.

_ _u8 nud_state
_ _u8 nud_state

指示条目的状态。可能的值在include/net/neighbour.hinclude/linux/rtnetlink.h中定义,名称为NUD_ XXX。国家的作用在第 26 章“ NUD 国家之间的转换”一节中进行了描述。第 26 章中的图 26-13显示了状态如何根据各种事件而变化。

Indicates the entry's state. The possible values are defined in include/net/neighbour.h and include/linux/rtnetlink.h with names of form NUD_ XXX. The role of states is described in the section "Transitions Between NUD States" in Chapter 26. Figure 26-13 in Chapter 26 shows how the state changes depending on various events.

_ _u8 type
_ _u8 type

neigh_create该参数是在通过调用协议构造函数方法(例如, arp_constructor对于 ARP)创建条目时设置的。它的值用于各种情况,例如决定赋予什么值nud_state可以采用第 36 章表 36-12type的值,列在 include/linux/rtnetlink.h中。

在本章的上下文中,并非该表的所有值都被实际使用:我们最感兴趣的是RTN_UNICASTRTN_LOCALRTN_BROADCASTRTN_ANYCASTRTN_MULTICAST

给定一个 IPv4 地址(例如与条目关联的 L3 地址neighbour),该inet_addr_type函数会查找关联RTN_ XXX值(请参阅第 28 章)。对于 IPv6,有一个类似的函数,称为ipv6_addr_type.

This parameter is set when the entry is created with neigh_create by calling the protocol constructor method (e.g., arp_constructor for ARP). Its value is used in various circumstances, such as to decide what value to give nud_state. type can assume the values in Table 36-12 in Chapter 36, listed in include/linux/rtnetlink.h.

In the context of this chapter, not all of the values of that table are actually used: we are mostly interested in RTN_UNICAST, RTN_LOCAL, RTN_BROADCAST, RTN_ANYCAST, and RTN_MULTICAST.

Given an IPv4 address (such as the L3 address associated with a neighbour entry), the inet_addr_type function finds the associated RTN_ XXX value (see Chapter 28). For IPv6, there is a similar function called ipv6_addr_type.

_ _u8 dead
_ _u8 dead

dead设置为 1 时,表示该结构正在被删除并且不能再使用。有关使用示例,请参阅第 32 章中的“外部事件neigh_ifdown部分,以及和。neigh_forced_gcneigh_periodic_timer

When dead is set to 1 it means the structure is being removed and cannot be used anymore. See neigh_ifdown in the section "External Events" in Chapter 32, and neigh_forced_gc and neigh_periodic_timer for examples of usage.

atomic_t probes
atomic_t probes

失败的招标尝试次数。它的值由计时器检查neigh_timer_handler,当尝试次数达到最大允许值时,计时器将neighbour条目置于该状态。NUD_FAILED

Number of failed solicitation attempts. Its value is checked by the neigh_timer_handler timer, which puts the neighbour entry into the NUD_FAILED state when the number of attempts reaches the maximum allowed value.

rwlock_t lock
rwlock_t lock

用于保护neighbour结构免受竞争条件的影响。

Used to protect the neighbour structure from race conditions.

unsigned char ha[]
unsigned char ha[]

与 L3 地址相关联的 L2 地址(例如,以太网 NIC 的以太网 MAC 地址)由primary_key (稍后讨论)表示。地址是二进制格式。向量的大小 (在include/linux/netdevice.hha中定义为MAX_ADDR_LEN32 ),向上舍入为 C 的第一个倍数 。以太网地址仅需要六个八位位组(即 48 位),但其他链路层协议可能需要更多。对于每种硬件地址类型,内核定义一个分配地址大小的符号。大多数符号使用诸如或 之类的名称。例如,以太网定义了 以下符号:longXXX _ALENXXX _ADDR_LENETH_ALEN包括/linux/if_ether.h

The L2 address (e.g., Ethernet MAC address for Ethernet NICs) associated with the L3 address represented by primary_key (discussed shortly). The address is in binary format. The size of the vector ha is MAX_ADDR_LEN (defined as 32 in include/linux/netdevice.h), rounded up to the first multiple of a C long. An Ethernet address requires only six octets (i.e., 48 bits), but other link layer protocols may require more. For each hardware address type, the kernel defines a symbol that is assigned the size of the address. Most symbols use names like XXX _ALEN or XXX _ADDR_LEN. Ethernet, for example, defines the ETH_ALEN symbol in include/linux/if_ether.h.

struct hh_cache *hh
struct hh_cache *hh

缓存的 L2 标头列表。请参阅第 27 章中的“ L2 标头缓存”部分。

List of cached L2 headers. See the section "L2 Header Caching" in Chapter 27.

atomic_t refcnt
atomic_t refcnt

参考计数。请参阅第 27 章中的“缓存”和“邻居结构的引用计数”部分。

Reference count. See the sections "Caching" and "Reference Counts on neighbour Structures" in Chapter 27.

int (*output)(struct sk_buff *skb)
int (*output)(struct sk_buff *skb)

用于向邻居传输帧的功能。该函数指针指向的实际例程在结构的生命周期内可能会更改多次,具体取决于几个因素。它首先由 的方法初始化neigh_table(参见第 28 章中的“邻居结构的初始化constructor部分)。当邻居状态变为 或状态时,可以分别调用或 来更新它。neigh_connectneigh_suspectNUD_REACHABLENUD_STALE

Function used to transmit frames to the neighbor. The actual routine this function pointer points to can change several times during the structure's lifetime, depending on several factors. It is first initialized by the neigh_table's constructor method (see the section "Initialization of a neighbour Structure" in Chapter 28). It can be updated by calling neigh_connect or neigh_suspect when the neighbor state goes to NUD_REACHABLE or NUD_STALE state, respectively.

struct sk_buff_head arp_queue
struct sk_buff_head arp_queue

目标 L3 地址尚未解析的数据包将暂时放入此队列中。尽管该字段的名称如此,但它可以被所有相邻协议使用,而不仅仅是 ARP。请参阅第 27 章中的“出口排队”部分。

Packets whose destination L3 address has not been resolved yet are temporarily placed into this queue. Despite the name of this field, it can be used by all neighboring protocols, not just ARP. See the section "Egress Queuing" in Chapter 27.

struct timer_list timer
struct timer_list timer

定时器用于处理多个任务。请参阅第 15 章中的“定时器”部分。

Timer used to handle several tasks. See the section "Timers" in Chapter 15.

struct neigh_ops *ops
struct neigh_ops *ops

VFT 包含用于操作条目的方法neighbour。例如,在这些方法中,有几种用于传输数据包,每种方法都针对不同的状态或相关设备类型进行了优化。每个协议提供三种或四种不同的VFT;用于特定 neighbour条目的地址取决于L3地址的类型、关联设备的类型以及链路的类型(例如,点对点)。请参阅即将到来的“ neigh_ops 结构”部分和第 27 章中的“ neigh->ops 的初始化”部分。

VFT containing the methods used to manipulate the neighbour entry. Among the methods, for instance, are several used to transmit packets, each optimized for a different state or associated device type. Each protocol provides three or four different VFTs; which is used for a specific neighbour entry depends on the type of L3 address, the type of associated device, and the type of link (e.g., point-to- point). See the upcoming section "neigh_ops Structure," and the section "Initialization of neigh->ops" in Chapter 27.

u8 primary_key[0];
u8 primary_key[0];

邻居的 L3 地址。它被缓存查找函数用作键。它是 ARP 条目的 IPv4 地址和邻居发现条目的 IPv6 地址。

L3 address of the neighbor. It is used as the key by the cache lookup functions. It is an IPv4 address for ARP entries and an IPv6 address for neighbor discovery entries.

neigh_table结构

neigh_table Structure

该结构用于调整相邻协议的行为。neigh_table内核中有几个实例 ,每个实例用于不同的协议:

This structure is used to tune the behavior of a neighboring protocol. There are a few instances of neigh_table in the kernel, each for a different protocol:

arp_tbl
arp_tbl

IPv4 使用的 ARP 协议(​​参见net/ipv4/arp.c

ARP protocol used by IPv4 (see net/ipv4/arp.c)

nd_tbl
nd_tbl

IPv6 使用的邻居发现协议(请参阅net/ipv6/ndisc.c

Neighbor discovery protocol used by IPv6 (see net/ipv6/ndisc.c)

dn_neigh_table
dn_neigh_table

DECnet 使用的邻居发现协议(参见net/decnet/dn_neigh.c

Neighbor discovery protocol used by DECnet (see net/decnet/dn_neigh.c)

clip_tbl
clip_tbl

ATM over IP 协议(​​参见net/atm/clip.c

ATM over IP protocol (see net/atm/clip.c)

这些neigh_table结构体在相关子系统在内核中初始化时被初始化,并被插入到 指向的全局列表中,如第 27 章中的图 27-2neigh_tables所示。

These neigh_table structures are initialized when the associated subsystems are initialized in the kernel, and are inserted into a global list pointed to by neigh_tables, as shown in Figure 27-2 in Chapter 27.

数据结构包含相邻协议所需的大部分(如果不是全部)信息。因此,每个neighbour条目都有一个neigh->tbl指向其关联的指针 neigh_table;例如,neighbour与 IPv4 地址关联的条目将有一个指向该arp_tbl结构的指针,而 IPv6 条目将有一个指向 的指针nd_tbl

The data structures contain most (if not all) of the information required by the neighboring protocol. Therefore, each neighbour entry has a neigh->tbl pointer to its associated neigh_table; for instance, a neighbour entry associated with an IPv4 address will have a pointer to the arp_tbl structure, whereas an IPv6 entry will have a pointer to nd_tbl.

为了更容易地理解逐个字段的描述,请参考四个表的初始化作为示例,特别是 ,这也在第 28 章的“ arp_tbl 表arp_tbl部分中讨论。

To understand the field-by-field descriptions more easily, refer to the initializations of the four tables as examples—in particular, arp_tbl, which is also discussed in the section "The arp_tbl Table" in Chapter 28.

struct neigh_table *next
struct neigh_table *next

链接列表中的所有协议表。

Links all the protocol tables in a list.

rwlock_t lock
rwlock_t lock

锁用于保护表免受可能的竞争条件的影响。它由诸如仅需要读取权限的函数以只读模式使用neigh_lookup,并由其他函数(如neigh_periodic_timer.

请注意,整个表由单个锁保护,而不是更细粒度的锁,例如表缓存的每个存储桶都有不同的锁。

Lock used to protect the table from possible race conditions. It is used in read-only mode by functions such as neigh_lookup that only need read permission, and in read/write mode by other functions such as neigh_periodic_timer.

Note that the whole table is protected by a single lock, as opposed to something more granular such as a different lock for each bucket of the table's cache.

char *id
char *id

这只是一个标识协议的字符串。它主要用作分配用于分配结构的内存池时的 ID neighbour(请参阅 参考资料neigh_table_init)。

This is just a string that identifies the protocol. It is used mainly as an ID when allocating the memory pool used to allocate neighbour structures (see neigh_table_init).

struct proc_dir_entry *pde
struct proc_dir_entry *pde

在/proc/net/stat/中注册的文件,用于导出有关协议的统计信息。例如,ARP 创建/proc/net/stat/arp_cacheneigh_table_init该文件是在协议初始化时创建的 。

File registered in /proc/net/stat/ to export statistics about the protocol. For instance, ARP creates /proc/net/stat/arp_cache. The file is created by neigh_table_init when the protocol is initialized.

int family
int family

由相邻协议表示的条目的地址族。它的可能值列在文件include/linux/socket.h中,名称格式为AF_ XXX。对于 IPv4 和 IPv6,关联值分别为 AF_INETAF_INET6

Address family of the entries represented by the neighboring protocol. Its possible values are listed in the file include/linux/socket.h, with names in the form AF_ XXX. For IPv4 and IPv6, the associated values are AF_INET and AF_INET6, respectively.

int entry_size
int entry_size

插入缓存的结构的大小。由于neighbour结构体包括大小取决于协议的字段 ( primary_key),entry_size因此被设置为结构体大小与协议提供的neighbour大小之和。primary_key例如,在 IPv4/ARP 的情况下,该字段被初始化为sizeof(struct neighbour) + 4,其中 4 当然是 IPv4 地址的字节大小。例如,当neigh_alloc清除从缓存中检索到的条目的内容时,会使用该字段。[ * ]

Size of the structures inserted into the cache. Since a neighbour structure includes a field whose size depends on the protocol (primary_key), entry_size is set to the sum of the size of a neighbour structure and the size of the primary_key provided by the protocol. In the case of IPv4/ARP, for instance, this field is initialized to sizeof(struct neighbour) + 4, where 4 is, of course, the size in bytes of an IPv4 address. The field is used, for instance, by neigh_alloc when clearing the content of the entries retrieved from the cache.[*]

int key_len
int key_len

查找函数使用的密钥长度(请参阅第 27 章中的“缓存”部分)。由于密钥是 L3 地址,因此 IPv4 为 4,IPv6 为 8,DECnet 为 2。

Length of the key used by the lookup functions (see the section "Caching" in Chapter 27). Because the key is the L3 address, this is 4 for IPv4, 8 for IPv6, and 2 for DECnet.

_ _u32 (*hash)(const void *pkey, const struct net_device *)
_ _u32 (*hash)(const void *pkey, const struct net_device *)

哈希函数应用于搜索键(例如,L3 地址),以在查找时选择哈希表的正确存储桶。

Hash function applied to the search key (e.g., L3 address) to select the right bucket of the hash table when doing a lookup.

int (*constructor)(struct neighbour *)
int (*constructor)(struct neighbour *)

constructor方法在创建新条目时调用 neigh_create,并初始化新条目的协议特定字段neighbour。例如,ARP()所使用的,在第28章“邻居结构的初始化arp_constructor一节中有详细描述。

The constructor method is invoked by neigh_create when creating a new entry, and initializes the protocol-specific fields of a new neighbour entry. For example, the one used by ARP (arp_constructor) is described in detail in the section "Initialization of a neighbour Structure" in Chapter 28.

struct neigh_parms parms
struct neigh_parms parms

该数据结构包含一些用于调整协议行为的参数,例如在未收到答复后重新发送请求请求之前需要等待多长时间,以及在传输之前在队列中等待答复的数据包数量。请参阅“ neigh_parms 结构”部分。

This data structure contains some parameters used to tune the behavior of the protocol, such as how much time to wait before resending a solicitation request after not receiving a reply, and how many packets to keep in a queue waiting for the reply before transmitting them. See the section "neigh_parms Structure."

struct neigh_parms *parms_list
struct neigh_parms *parms_list

不曾用过。

Not used.

kmem_cache_t *kmem_cachep
kmem_cache_t *kmem_cachep

分配结构时使用的内存池neighbour 。它在协议初始化时由 分配和初始化 neigh_table_init您可以通过转储/proc/slabinfo文件的内容来检查其状态 。

Memory pool used when allocating neighbour structures. It is allocated and initialized at protocol initialization time by neigh_table_init. You can check its status by dumping the contents of the /proc/slabinfo file.

atomic_t entries
atomic_t entries

neighbour当前协议缓存中的实例数。当使用 分配新条目时,其值会递增 neigh_alloc;而使用 解除分配条目时,其值会递减neigh_destroy。请参阅本节后面的gc_thresh1gc_thresh2、 和的描述。gc_thresh3

Number of neighbour instances currently in the protocol's cache. Its value is incremented when allocating a new entry with neigh_alloc and decremented when deallocating an entry with neigh_destroy. See the description of gc_thresh1, gc_thresh2, and gc_thresh3 later in this section.

unsigned long last_rand
unsigned long last_rand

与表关联的结构变量(每个设备都有一个)最近更新的时间jiffies(以 表示)。reachable_timeneigh_parms

Time (expressed in jiffies) when the variable reachable_time of the neigh_parms structures associated with the table (there is one for each device) was most recently updated.

struct neigh_statistics *stats
struct neigh_statistics *stats

neighbour 有关缓存中实例的各种统计信息。请参阅“ neigh_statistics 结构”部分。

Various statistics about the neighbour instances in the cache. See the section "neigh_statistics Structure."

struct neighbour **hash_buckets
struct neighbour **hash_buckets

存储neighbour 条目的哈希表。

Hash table that stores the neighbour entries.

unsigned int hash_mask
unsigned int hash_mask

哈希表的大小。请参见第 27 章中的图 27-6

Size of the hash table. See Figure 27-6 in Chapter 27.

_ _u32 hash_rnd
_ _u32 hash_rnd

当缓存大小增加时,用于分配neighbour 缓存中条目的随机值。请参阅第 27 章中的“缓存”部分。

Random value used to distribute neighbour entries in the cache when its size is increased. See the section "Caching" in Chapter 27.

第 27 章“垃圾收集”部分中描述的垃圾收集算法使用以下变量和函数:

The following variables and functions are used by the garbage collection algorithm described in the section "Garbage Collection" in Chapter 27:

int gc_interval
int gc_interval

这控制计时器到期的频率gc_timer,启动垃圾收集。以前是30秒,但现在更短了。计时器每次仅对哈希表的一个桶进行垃圾收集。有关详细信息,请参阅第 27 章中的“垃圾收集”部分。

This controls how often the gc_timer timer expires, kicking off garbage collection. It used to be 30 seconds but now it is shorter. The timer causes garbage collection on only one bucket of the hash table each time. See the section "Garbage Collection" in Chapter 27 for more information.

int gc_thresh1
int gc_thresh1

int gc_thresh2
int gc_thresh2

int gc_thresh3
int gc_thresh3

neighbour这三个阈值定义了授予相邻协议当前缓存的条目的不同内存使用级别 。

These three thresholds define different levels of memory usage granted to the neighbour entries currently cached by the neighboring protocol.

unsigned long last_flush
unsigned long last_flush

该变量以 测量jiffies,表示最近neigh_forced_gc 执行的时间。换句话说,它代表最近一次由于内存不足而强制执行垃圾收集过程的时间。

This variable, measured in jiffies, represents the most recent time neigh_forced_gc was executed. In other words, it represents the most recent time a garbage collection process was forced because of low memory conditions.

struct timer_list gc_timer
struct timer_list gc_timer

垃圾收集计时器。请参阅第 27 章中的“垃圾收集”部分。

Garbage collector timer. See the section "Garbage Collection" in Chapter 27.

unsigned int hash_chain_gc
unsigned int hash_chain_gc

跟踪定期垃圾收集器计时器应扫描的哈希表的下一个存储桶。按顺序扫描桶。

Keeps track of the next bucket of the hash table the periodic garbage collector timer should scan. The buckets are scanned sequentially.

当系统充当代理时,将使用以下字段。请参阅第 27 章中的“充当代理”部分。

The following fields are used when the system acts as a proxy. See the section "Acting As a Proxy" in Chapter 27.

struct pneigh_entry **phash_buckets
struct pneigh_entry **phash_buckets

存储必须代理的 L3 地址的表。

Table that stores the L3 addresses that must be proxied.

int (*pconstructor)(struct pneigh_entry *)
int (*pconstructor)(struct pneigh_entry *)

void (*pdestructor)(struct pneigh_entry *)
void (*pdestructor)(struct pneigh_entry *)

pconstructor是 的对应项constructor。目前,只有 IPv6 使用pconstructor; 当首次配置关联设备时,它会注册特定的多播地址。

pdestructor释放代理条目时调用。它仅由 IPv6 使用并取消该pconstructor方法的工作。

pconstructor is the counterpart of constructor. Right now, only IPv6 uses pconstructor; it registers a specific multicast address when the associated device is first configured.

pdestructor is called when releasing a proxy entry. It is used only by IPv6 and undoes the work of the pconstructor method.

struct sk_buff_head proxy_queue
struct sk_buff_head proxy_queue

ARPOP_REQUEST当启用代理并配置非空延迟时,接收到的请求请求(例如,在 ARP 的情况下接收到的数据包)将排队到此队列中proxy_delay。新元素在尾部排队。

Received solicit requests (e.g., received ARPOP_REQUEST packets in the case of ARP) are queued into this queue when proxying is enabled and configured with a non-null proxy_delay delay. New elements are queued at the tail.

void (*proxy_redo)(struct sk_buff *skb)
void (*proxy_redo)(struct sk_buff *skb)

ARPOP_REQUEST从代理队列中提取请求后对其进行处理的功能(例如, ARP 数据包) neigh_table->proxy_queue请参阅第 27 章中的“招标请求的延迟处理”部分。

Function that processes the solicit requests (e.g., ARPOP_REQUEST packets for ARP) after they are extracted from the proxy queue neigh_table->proxy_queue. See the section "Delayed Processing of Solicitation Requests" in Chapter 27.

struct timer_list proxy_timer
struct timer_list proxy_timer

当 中至少有一个元素时,该计时器就会启动proxy_queue。定时器到期时执行的处理程序是neigh_proxy_process。定时器在协议初始化时由 初始化neigh_table_init。与计时器不同的是neigh_table->gc_timer,这个计时器不是周期性的,并且仅在需要时才启动(例如,协议可能会在将第一个元素添加到 时启动它proxy_queue)。第 27 章中的“充当代理”部分描述了元素排队的原因和时间以及如何处理它们。proxy_queueproxy_timer

This timer is started when there is at least one element in proxy_queue. The handler that is executed when the timer expires is neigh_proxy_process. The timer is initialized at protocol initialization by neigh_table_init. Unlike the timer neigh_table->gc_timer, this one is not periodic and is started only if needed (for instance, a protocol might start it when the first element is added to proxy_queue). The section "Acting As a Proxy" in Chapter 27 describes why and when elements are queued to proxy_queue and how proxy_timer processes them.

neigh_parms 结构

neigh_parms Structure

neigh_parms数据结构存储相邻协议的可配置参数。neigh_parms对于每个配置的使用邻居协议的 L3 协议,每个设备都有一个实例[ * ]加 1 来存储默认值。

The neigh_parms data structure stores the configurable parameters of the neighboring protocol. For each configured L3 protocol that uses a neighbor protocol, there is one instance of neigh_parms for each device[*] plus one that stores the default values.

以下是逐个字段的描述:

Here is the field-by-field description:

struct neigh_parms *next
struct neigh_parms *next

链接neigh_parms与同一协议族关联的实例的指针。这意味着每个neigh_table都有自己的neigh_parms结构列表,每个配置的设备都有一个实例(参见第 27 章中的图 27-2)。

Pointer that links neigh_parms instances associated with the same protocol family. This means that each neigh_table has its own list of neigh_parms structures, one instance for each configured device (see Figure 27-2 in Chapter 27).

int (*neigh_setup)(struct neighbour *)
int (*neigh_setup)(struct neighbour *)

初始化函数主要由那些仍在使用旧的相邻基础设施的设备使用。该函数通常仅用于初始化 neighbour->ops实例(请参阅本章后面的“ neigh_ops 结构arp_broken_ops”部分,以及第 27 章中的“ neigh->ops 的初始化”部分)。查看drivers/net/shaper.c中的示例。要了解在新实例的初始化阶段何时调用此初始化函数,请参见第 28 章中的图 28-11shaper_neigh_setupneighbour

不要将此虚拟函数与 混淆net_device->neigh_setup。当在设备上配置第一个 L3 地址时调用后者,并且通常也会初始化neigh_parms->neigh_setupnet_device->neigh_setup对于每个设备仅调用一次,并且对于将与该设备关联的neigh_parms->neigh_setup每个结构调用一次。neighbour

Initialization function used mainly by those devices that are still using the old neighboring infrastructure. This function is normally used just to initialize neighbour->ops to the arp_broken_ops instance (see the section "neigh_ops Structure" later in this chapter, and the section "Initialization of neigh->ops" in Chapter 27). Look at shaper_neigh_setup in drivers/net/shaper.c for an example. To see when this initialization function is called during the initialization phase of a new neighbour instance, see Figure 28-11 in Chapter 28.

Do not confuse this virtual function with net_device->neigh_setup. The latter is called when the first L3 address is configured on a device, and normally initializes neigh_parms->neigh_setup, too. net_device->neigh_setup is called only once for each device, and neigh_parms->neigh_setup is called once for each neighbour structure that will be associated with the device.

struct neigh_table *tbl
struct neigh_table *tbl

指向保存neigh_table该结构的结构的返回指针。

Back pointer to the neigh_table structure that holds this structure.

int entries
int entries

void *priv
void *priv

不曾用过。

Not used.

void *sysctl_table
void *sysctl_table

该表在文件net/ipv4/neighbour.c的末尾初始化,允许用户修改通过/procneigh_parms 导出的数据结构的参数值,如“通过 / 调整”部分所述。proc 文件系统。”

This table, initialized at the end of the file net/ipv4/neighbour.c, is involved in allowing users to modify the values of those parameters of the neigh_parms data structure that are exported via /proc, as described in the section "Tuning via /proc Filesystem."

int base_reachable_time
int base_reachable_time

int reachable_time
int reachable_time

base_reachable_timejiffies是自收到最新可达性证明以来的时间间隔(以 表示)。请注意,该间隔用作计算实数的基值,该实数存储在reachable_time [ * ]中,并给出一个范围在base_reachable_time和 3/2之间的随机(均匀分布)值base_reachable_time。该随机值每 300 秒更新一次neigh_periodic_timer,但也可以由其他事件更新(特别是对于 IPv6)。

base_reachable_time is the interval of time (expressed in jiffies) since the most recent proof of reachability was received. Note that this interval is used as a base value to compute the real one, which is stored in reachable_time [*] and is given a random (and uniformly distributed) value ranging between base_reachable_time and 3/2 base_reachable_time. This random value is updated every 300 seconds by neigh_periodic_timer, but it can also be updated by other events (especially for IPv6).

int retrans_time
int retrans_time

当主机在 时间内没有收到对请求请求的答复时retrans_time,将发送一个新请求,直到给定的最大尝试次数为止。retrans_time表达为jiffies.

When a host does not receive a reply to a solicitation request within retrans_time, a new one is sent, up to a given number of maximum attempts. retrans_time is expressed in jiffies.

int gc_staletime
int gc_staletime

如果一个neighbour结构在一段时间内没有被使用gc_staletime并且没有人持有它的引用,那么它就会被删除。gc_staletime表达为jiffies.

A neighbour structure is removed if it has not been used for gc_staletime time and no one holds a reference to it. gc_staletime is expressed in jiffies.

int delay_probe_time
int delay_probe_time

这表示处于该状态的邻居在NUD_DELAY进入该NUD_PROBE状态之前等待的时间。请参见第 26 章中的图 26-13

This indicates how long a neighbor in the NUD_DELAY state waits before entering the NUD_PROBE state. See Figure 26-13 in Chapter 26.

int queue_len
int queue_len

可以在队列中排队的最大元素数arp_queue

Maximum number of elements that can be queued in the arp_queue queue.

int proxy_qlen
int proxy_qlen

可以在队列中排队的最大元素数proxy_queue

Maximum number of elements that can be queued in the proxy_queue queue.

int ucast_probes
int ucast_probes

int app_probes
int app_probes

int mcast_probes
int mcast_probes

ucast_probes是可以发送以确认地址可达性的单播请求的数量。

app_probes是解析地址时用户空间应用程序可以发送的请求数(有关 IPv4/ARP 情况,请参阅第 28 章中的“ ARP​​D ”部分)。

mcast_probes是可以发送以解析邻居地址的多播请求数。对于 ARP/IPv4,这实际上是广播请求的数量,因为 ARP 不使用多播请求。IPv6 可以。

请注意,mcast_probesapp_probes是互斥的(只有一个可以为非空)。

ucast_probes is the number of unicast solicitations that can be sent to confirm the reachability of an address.

app_probes is the number of solicitations that can be sent by a user-space application when resolving an address (see the section "ARPD" in Chapter 28 for the IPv4/ARP case).

mcast_probes is the number of multicast solicitations that can be sent to resolve a neighbor's address. For ARP/IPv4, this is actually the number of broadcast solicitations, because ARP does not use multicast solicitations. IPv6 does.

Note that mcast_probes and app_probes are mutually exclusive (only one can be non-null).

int anycast_delay
int anycast_delay

不曾用过。

Not used.

int proxy_delay
int proxy_delay

jiffies由代理处理的相邻协议数据包在处理之前应保留在队列中的时间量(以 表示)。请参阅第 27 章中的“招标请求的延迟处理”部分。

Amount of time (expressed in jiffies) that neighboring protocol packets handled by a proxy should be kept in a queue before being processed. See the section "Delayed Processing of Solicitation Requests" in Chapter 27.

int locktime
int locktime

jiffies条目字段的两次更新之间必须经过的最短时间neighbour(通常为nud_stateha) ,以 表示。此窗口有助于避免可能发生的一些令人讨厌的乒乓效应,例如,当同一网段上存在多个代理 ARP 服务器并且所有这些服务器都使用冲突的地址回复相同的查询请求时。此行为的详细信息将在第 28 章的“最终公共处理”部分中讨论。

Minimum time, expressed in jiffies, that has to pass between two updates of the fields of a neighbour entry (typically nud_state and ha). This window helps avoid some nasty ping-pong effects that can take place, for instance, when more than one proxy ARP server is present on the same network segment and all of them reply to the same query solicitations with conflicting addresses. Details of this behavior are discussed in the section "Final Common Processing" in Chapter 28.

int dead
int dead

设置为将邻居实例标记为“正在删除”的布尔标志。见 neigh_parms_release

Boolean flag that is set to mark the neighbor instance as "Being removed." See neigh_parms_release.

atomic_t refcnt
atomic_t refcnt

参考计数。

Reference count.

struct rcu_head rcu_head
struct rcu_head rcu_head

用于处理互斥。

Used to take care of mutual exclusion.

引用计数的使用refcnt值得多说几句。讨论时请参考第 27 章图 27-2 。由于每个设备每个协议 都有一个实例,并且结构中嵌入一个实例来保存默认值,再加上每个结构中的一个指针,因此可能会让人难以理解谁指向谁以及谁是谁。让我们尝试澄清这些要点。neigh_parmsneigh_tableneighbour

The use of the reference count refcnt deserves a few more words. Please refer to Figure 27-2 in Chapter 27 during this discussion. Because there is an instance of neigh_parms per device per protocol, and one instance embedded in the neigh_table structure to hold the default values, plus a pointer in each neighbour structure, it may be confusing to understand who points to whom and who is who. Let's try to clarify these points.

每个neigh_table,因此每个协议,都有自己的 实例neigh_parms。该实例保存协议提供的默认值。每个设备都net_device可以配置多个 L3 协议。对于配置的每个 L3 协议,net_device都有一个指向存储配置的协议特定结构的指针(例如,in_device对于 IPv4)。该结构包括指向该实例的指针,neigh_parms该实例用于存储 L3 协议(例如,用于 IPv4 的 ARP)所使用的相邻协议的设备特定配置。

Each neigh_table, and therefore each protocol, has its own instance of neigh_parms. That instance holds the default values that the protocol provides. Each device's net_device can be configured with more than one L3 protocol. For each L3 protocol configured, net_device has a pointer to a protocol-specific structure that stores the configuration (e.g., in_device for IPv4). That structure includes a pointer to an instance of neigh_parms that is used to store the device-specific configuration of the neighboring protocol used by the L3 protocol (e.g., ARP for IPv4).

表 29-5列出了分配结构的主要协议初始化例程neigh_parms对于两种IP协议,结果如图29-3所示 。

Table 29-5 lists the main protocol initialization routines, which allocate neigh_parms structures. For the two IP protocols, you can see the result in Figure 29-3.

表 29-5。L3 协议初始化函数

Table 29-5. L3 protocol init functions

协议

Protocol

功能

Function

文件

File

IPv4

IPv4

inetdev_init

inetdev_init

net/ipv4/devinet.c

net/ipv4/devinet.c

IPv6

IPv6

ipv6_add_dev

ipv6_add_dev

net/ipv6/addrconf.c

net/ipv6/addrconf.c

数字化网络

DECnet

dn_dev_create

dn_dev_create

net/decnet/dn_dev.v

net/decnet/dn_dev.v

在接下来的描述中,我们将继续使用 IPv4。ARP 使用的实例neigh_parms由 分配inetdev_init,这是首次将 IPv4 配置应用于设备时调用的 IPv4 例程。新neigh_parms实例的初始内容从 复制neigh_table->parms,其中neigh_tablearp_tblARP。每当neighbour创建实例时,neigh->parms都会初始化为指向neigh_parms关联设备的实例。正如我们在“通过 /proc 文件系统调整”一节中看到的,neigh_table->parms管理员可以更改全局默认值 ( ) 和每个设备的配置。

Let's stick to IPv4 for the rest of the description. The neigh_parms instance used by ARP is allocated by inetdev_init, the IPv4 routine called when an IPv4 configuration is first applied to a device. The initial content of the new neigh_parms instance is copied from neigh_table->parms, where neigh_table is arp_tbl for ARP. Whenever a neighbour instance in created, neigh->parms is initialized to point to the neigh_parms instance of the associated device. As we saw in the section "Tuning via /proc Filesystem," both the global defaults (neigh_table->parms) and the per-device configuration can be changed by the administrator.

因为每个设备结构都被与该设备关联的neigh_parms所有实例引用,所以用于跟踪它们。直接或间接更新引用计数的例程有:neighbourneigh_parms->refcnt

Because each per-device neigh_parms structure is referenced by all the neighbour instances associated with the device, neigh_parms->refcnt is used to keep track of them. The routines that directly or indirectly update the reference count are:

neigh_parms alloc
neigh_parms alloc

neigh_parms_destroy
neigh_parms_destroy

分配和销毁 的实例neigh_parmsneigh_parms_destroy仅当结构可以被释放时调用,因为引用计数为 0。

Allocate and destroy an instance of neigh_parms. neigh_parms_destroy is called only when the structure can be freed because the reference count is 0.

_ _neigh_parms_put
_ _neigh_parms_put

neigh_parms_put
neigh_parms_put

_ _neigh_parms_put仅减少引用计数,并且当引用计数变为 0 时neigh_parms_put也会调用。neigh_parms_destroy

_ _neigh_parms_put only decrements the reference count, and neigh_parms_put also invokes neigh_parms_destroy if the reference count becomes 0.

neigh_parms_release
neigh_parms_release

将实例标记为死亡并间接调用neigh_parms_put.

Marks the instance as dead and indirectly invokes neigh_parms_put.

neigh_parms_clone
neigh_parms_clone

增加结构体的引用计数并返回指向该结构体的指针。

Increases the reference count on a structure and returns a pointer to it.

neigh_rcu_free_parms
neigh_rcu_free_parms

调用 byneigh_parms_release来实际删除结构(这里是neigh_parms->rcu_head使用的地方)。

Called by neigh_parms_release to actually delete the structure (here is where neigh_parms->rcu_head is used).

neigh_ops 结构

neigh_ops Structure

neigh_ops 结构由指向条目生命周期内不同时间调用的函数的指针组成neighbour。它们大多数是虚拟函数,充当 L3 协议和第 11 章dev_queue_xmit中介绍的 API之间的接口。其中一些由总体相邻基础设施(功能)提供,其他则由各个相邻协议(例如,ARP)提供。请参阅第 28 章中的“邻居结构的初始化”部分。neigh_ xxxarp_xxx

The neigh_ops structure consists of pointers to functions invoked at various times during the lifetime of a neighbour entry. Most of them are virtual functions that act as the interface between the L3 protocol and the dev_queue_xmit API introduced in Chapter 11. Some of them are provided by the overarching neighboring infrastructure (neigh_ xxx functions), and others are provided by individual neighboring protocols (e.g., arp_xxx for ARP). See the section "Initialization of a neighbour Structure" in Chapter 28.

这些函数之间的主要区别在于它们的使用上下文。第 26 章的“特殊情况” 部分涵盖了两种最常见的情况。

The main difference between the functions lies in the context where they are used. The section "Special Cases" in Chapter 26 covered the two most common cases.

以下是逐个字段的描述:

Here is the field-by-field description:

int family
int family

family我们在描述结构的类似字段时已经看到了这个字段neigh_table

We already saw this field when describing the analogous family field of the neigh_table structure.

void (*destructor)(struct neighbour *)
void (*destructor)(struct neighbour *)

neighbour当条目被 删除时执行的函数neigh_destroy。它基本上是 的补充方法neigh_table->constructor。但由于某种原因,constructor它在neigh_table结构中并且destructorneigh_ops 结构中。

Function executed when a neighbour entry is removed by neigh_destroy. It basically is the complementary method of neigh_table->constructor. But for some reason, constructor is in the neigh_table structure and destructor is in the neigh_ops structure.

void (*solicit)(struct neighbour *, struct sk_buff*)
void (*solicit)(struct neighbour *, struct sk_buff*)

用于发送招标请求的功能。

Function used to send solicitation requests.

void (*error_report)(struct neighbour *, struct sk_buff*)
void (*error_report)(struct neighbour *, struct sk_buff*)

当邻居被分类为不可达时调用的函数。请参阅第 27 章中的“相邻层生成的事件”部分。

Function invoked when a neighbor is classified as unreachable. See the section "Events Generated by the Neighboring Layer" in Chapter 27.

以下四种方法用于传输数据包,而不是相邻协议包。它们之间的区别在于使用它们的上下文。请参阅第 27 章中的“ L3 协议和相邻协议之间的公共接口”部分。

The following four methods are used to transmit data packets, not neighboring protocol packets. The difference between them lies in the context where they are used. See the section "Common Interface Between L3 Protocols and Neighboring Protocols" in Chapter 27.

int (*output)(struct sk_buff*)
int (*output)(struct sk_buff*)

这是最通用的函数,可以在所有上下文中使用。它检查地址是否已被解析,如果尚未解析则启动解析。如果地址尚未准备好,它将数据包存储在临时队列中并开始解析。由于它会采取一切必要措施来确保可以联系到收件人,因此这是一项相对昂贵的操作。不要 neigh_ops->output与混淆neighbour->output

This is the most generic function and can be used in all the contexts. It checks if the address has already been resolved and starts the resolution in case it has not. If the address is not ready yet, it stores the packet in a temporary queue and starts the resolution. Because it does everything necessary to ensure the recipient is reachable, it is a relatively expensive operation. Do not confuse neigh_ops->output with neighbour->output.

int (*connected_output)(struct sk_buff*)
int (*connected_output)(struct sk_buff*)

当已知邻居可达(即状态为NUD_CONNECTED)时使用。它只是填充 L2 标头,因为所有必需的信息都可用,因此比 output.

Used when the neighbor is known to be reachable (i.e., the state is NUD_CONNECTED). It simply fills in the L2 header, because all the required information is available, and therefore is faster than output.

int (*hh_output)(struct sk_buff*)
int (*hh_output)(struct sk_buff*)

当地址已解析并且整个标头的副本已从先前的传输中缓存时使用。请参阅第 27 章中的“相邻协议和 L3 传输功能之间的交互”部分。

Used when the address is resolved and a copy of the whole header has already been cached from a previous transmission. See the section "Interaction Between Neighboring Protocols and L3 Transmission Functions" in Chapter 27.

int (*queue_xmit)(struct sk_buff*)
int (*queue_xmit)(struct sk_buff*)

前面的函数(除了 )hh_output实际上并不传输数据包。他们所做的就是确保标头已编译,并在queue_xmit缓冲区准备好传输时调用该方法。参见第 27 章中的图 27-3(b)

The previous functions, with the exception of hh_output, do not actually transmit the packets. All they do is make sure the header is compiled and call the queue_xmit method when the buffer is ready for transmission. See Figure 27-3(b) in Chapter 27.

hh_cache结构

hh_cache Structure

用于存储缓存的 L2 标头的数据结构是,在include/linux/netdevice.hstruct hh_cache中定义。(名称来自“硬件头”。)以下是其字段的说明;第 27 章中的“ L2 标头缓存”部分描述了如何使用它。

The data structure used to store a cached L2 header is struct hh_cache, defined in include/linux/netdevice.h. (The name comes from "hardware header.") The following is a description of its fields; the section "L2 Header Caching" in Chapter 27 describes how it is used.

unsigned short hh_type
unsigned short hh_type

与 L3 地址关联的协议(请参阅ETH_P_ XXX文件include/linux/if_ether.h中的值)。

Protocol associated with the L3 address (see the ETH_P_ XXX values in the file include/linux/if_ether.h).

struct hh_cache *hh_next
struct hh_cache *hh_next

多个缓存的 L2 标头可以与同一neighbour条目关联。然而,对于任何给定的值只能有一个条目hh_type(请参阅neigh_hh_init)。

More than one cached L2 header can be associated with the same neighbour entry. However, there can be only one entry for any given value of hh_type (see neigh_hh_init).

atomic_t hh_refcnt
atomic_t hh_refcnt

参考计数。

Reference count.

int hh_len
int hh_len

缓存标头的长度(以字节为单位)。

Length of the cached header expressed in bytes.

int (*hh_output)(struct sk_buff *skb)
int (*hh_output)(struct sk_buff *skb)

用于传输数据包的函数。与 一样neigh->output,该方法被初始化为 VFT 的方法之一neigh->ops

Function used to transmit the packet. As with neigh->output, this method is initialized to one of the methods of the neigh->ops VFT.

rwlock_t hh_lock
rwlock_t hh_lock

锁用于保护hh_cache结构免受可能的竞争条件的影响。例如,想要传输数据包的 IP 功能(请参阅第 27 章中的“相邻协议和 L3 传输功能之间的交互”部分)在将标头从 结构复制到缓冲区之前获取读锁定。当需要更新结构体的字段时,锁以独占模式持有:例如,当需要初始化为不同的函数[ * ]或当hh_cacheskbhh_outputhh_cache->hh_data 由于目标链路层地址已更改,因此需要更新标头。

Lock used to protect the hh_cache structure from possible race conditions. For instance, an IP function that wants to transmit a packet (see the section "Interaction Between Neighboring Protocols and L3 Transmission Functions" in Chapter 27) acquires the read lock before copying the header from the hh_cache structure to the skb buffer. The lock is held in exclusive mode when a field of the structure needs to be updated: for instance, the lock is acquired when hh_output needs to be initialized to a different function[*] or when the hh_cache->hh_data header needs to be updated because the destination link layer address has changed.

unsigned long hh_data[HH_DATA_ALIGN(LL_MAX_HEADER) / sizeof(long)]
unsigned long hh_data[HH_DATA_ALIGN(LL_MAX_HEADER) / sizeof(long)]

缓存的标头。

Cached header.

neigh_statistics 结构

neigh_statistics Structure

该结构存储有关的统计信息邻近协议,可供用户仔细阅读。每个协议都保留自己的结构实例。这是include/net/neighbour.h中结构的定义。下面是其字段的说明:

This structure stores statistics about the neighboring protocols, available for users to peruse. Each protocol keeps its own instance of the structure. This is the definition of the structure from include/net/neighbour.h. The following is a description of its fields:

unsigned long allocs
unsigned long allocs

neighbour协议分配的结构总数。包括那些已经被删除的。

Total number of neighbour structures allocated by the protocol. Includes ones that have already been removed.

unsigned long destroys
unsigned long destroys

已删除条目的数量neighbour。更新于neigh_destroy.

Number of removed neighbour entries. Updated in neigh_destroy.

unsigned long hash_grows
unsigned long hash_grows

哈希表大小增加的次数。更新 (参见第 27 章中的“缓存neigh_hash_grow部分)。

Number of times that the hash table has been increased in size. Updated in neigh_hash_grow (see the section "Caching" in Chapter 27).

unsigned long res_failed
unsigned long res_failed

尝试解析邻居地址失败的次数。每次发送新的请求时,该值不会递增;neigh_timer_handler仅当所有尝试都失败时,它才会递增 。

Number of times an attempt to resolve a neighbor address failed. This value is not incremented every time a new solicitation is sent; it is incremented by neigh_timer_handler only when all the attempts have failed.

unsigned long lookups
unsigned long lookups

neigh_lookup例程被调用的次数。

Number of times the neigh_lookup routine has been invoked.

unsigned long hits
unsigned long hits

返回成功的次数neigh_lookup

Number of times neigh_lookup returned success.

unsigned long rcv_probes_mcast
unsigned long rcv_probes_mcast

unsigned long rcv_probes_ucast
unsigned long rcv_probes_ucast

这两个字段仅由 IPv6 使用,分别表示收到的发送到多播和单播地址的请求请求(探测)的数量。

These two fields are used only by IPv6 and represent the number of solicitation requests (probes) received that were sent to multicast and unicast addresses, respectively.

unsigned long periodic_gc_runs
unsigned long periodic_gc_runs

unsigned long forced_gc_runs
unsigned long forced_gc_runs

分别是调用次数neigh_periodic_timerneigh_forced_gc已调用次数。请参阅第 27 章中的“垃圾收集”部分。

The number of times neigh_periodic_timer and neigh_forced_gc have been invoked, respectively. See the section "Garbage Collection" in Chapter 27.

内核为每个 CPU 保留这些计数器的一个实例。计数器使用include/net/neighbour.hNEIGH_CACHE_STAT_INC中定义的宏进行更新。请注意,该宏会更新当前 CPU 上的计数器。

The kernel keeps an instance of these counters for each CPU. The counters are updated with the NEIGH_CACHE_STAT_INC macro, defined in include/net/neighbour.h. Note that the macro updates the counter on the current CPU.

neigh_statistic结构的字段在每个协议/proc/net/stat/ _文件中导出。{protocol_name} cache

The fields of the neigh_statistic structure are exported in the per-protocol /proc/net/stat/ {protocol_name} _ cache files.

本书这一部分介绍的数据结构

Data Structures Featured in This Part of the Book

表29-6总结了本书涉及相邻子系统的章节中介绍或引用的主要函数、变量和数据结构。

Table 29-6 summarizes the main functions, variables, and data structures introduced or referenced in the chapters of this book covering the neighboring subsystem.

表 29-6。相邻子系统中的函数、变量和数据结构

Table 29-6. Functions, variables, and data structures in the neighboring subsystem

功能

Functions

描述

Description

dev_queue_xmit

dev_queue_xmit

neigh_compat_output

neigh_compat_output

neigh_resolve_output

neigh_resolve_output

neigh_connected_output

neigh_connected_output

neigh_blackhole

neigh_blackhole

用于数据包传输的主要例程。请参阅第 27 章中的“用于 neigh->output 的例程”部分。

Main routines used for packet transmission. See the section "Routines used for neigh->output" in Chapter 27.

neigh_update

neigh_update

neigh_update_hhs

neigh_update_hhs

neigh_sync

neigh_sync

更新存储在结构中的信息neighbour请参阅第 27 章中的“更新邻居信息:neigh_update ”部分。

Update the information stored in a neighbour structure. See the section "Updating a Neighbor's Information: neigh_update" in Chapter 27.

neigh_confirm

neigh_confirm

确认邻居的可达性。

Confirms the reachability of a neighbor.

neigh_create

neigh_create

neigh_destroy

neigh_destroy

neighbour作为协议事件的结果创建和删除结构。请参阅第 27 章中的“创建邻居条目”和“邻居删除” 部分。

Create and delete a neighbour structure as a consequence of protocol events. See the sections "Creating a neighbour Entry" and "Neighbor Deletion" in Chapter 27.

neigh_add

neigh_add

neigh_delete

neigh_delete

neighbour作为用户空间命令的结果创建和删除结构。请参阅“邻居的系统管理”部分。

Create and delete a neighbour structure as a consequence of a user-space command. See the section "System Administration of Neighbors."

neigh_alloc

neigh_alloc

分配一个neighbour结构。

Allocates a neighbour structure.

neigh_connect

neigh_connect

neigh_suspect

neigh_suspect

用于实现可达性。参见第 27 章“ neigh->output 和 neigh->nud_state 的初始化”部分。

Used to implement reachability. See the section "Initialization of neigh->output and neigh->nud_state" in Chapter 27.

neigh_table_init

neigh_table_init

注册相邻协议。

Registers a neighboring protocol.

neigh_ifdown

neigh_ifdown

当外部子系统通知时,处理 L3 地址中的状态变化。请参阅第 27 章中的“通过 neigh_ifdown 进行更新”部分。

Handles changes of state in the L3 address when notified by external subsystems. See the section "Updates via neigh_ifdown" in Chapter 27.

neigh_proxy_process

neigh_proxy_process

代理计时器到期时执行的函数处理程序。请参阅第 27 章中的“招标请求的延迟处理”部分。

Function handler executed when the proxy timer expires. See the section "Delayed Processing of Solicitation Requests" in Chapter 27.

neigh_timer_handler

neigh_timer_handler

请参阅第 15 章中的“定时器”部分。

See the section "Timers" in Chapter 15.

neigh_periodic_timer

neigh_periodic_timer

neigh_forced_gc

neigh_forced_gc

由垃圾收集算法使用。请参阅第 27 章中的“垃圾收集”部分。

Used by the garbage collection algorithm. See the section "Garbage Collection" in Chapter 27.

neigh_lookup

neigh_lookup

_ _neigh_lookup

_ _neigh_lookup

_ _neigh_lookup_errno

_ _neigh_lookup_errno

arp_find

arp_find

检查缓存中的条目。请参阅第 27 章中的“缓存”部分。

Check for an entry in the cache. See the section "Caching" in Chapter 27.

neigh_hold

neigh_hold

neigh_release

neigh_release

增加/减少结构上的引用计数neighbour

Increment/decrement the reference count on a neighbour structure.

pneigh_enqueue

pneigh_enqueue

pneigh_lookup

pneigh_lookup

用于基于目标的代理。请参阅第 27 章中的“请求请求的延迟处理”和“每设备代理和每目的地代理”部分,以及第 28 章中的“代理 ARP ”部分。

Used for destination-based proxying. See the sections "Delayed Processing of Solicitation Requests" and "Per-Device Proxying and Per-Destination Proxying" in Chapter 27, and the section "Proxy ARP" in Chapter 28.

arp_rcv

arp_rcv

ndisc_rcv

ndisc_rcv

分别用于 ARP 和 ND 数据包的协议处理程序。

Protocol handlers for ARP and ND packets, respectively.

ip_finish_output2

ip_finish_output2

ip6_output_finish

ip6_output_finish

分别支持IPv4和IPv6的传输功能。请参阅第 27 章中的“相邻协议和 L3 传输功能之间的交互”部分。

Transmission functions for IPv4 and IPv6, respectively. See the section "Interaction Between Neighboring Protocols and L3 Transmission Functions" in Chapter 27.

neigh_hh_init

neigh_hh_init

使用 L2 标头初始化hh_cache结构并将其绑定到关联的路由表缓存条目。请参阅第 27 章中的“路由和 L2 报头缓存之间的链接”部分。

Initializes an hh_cache structure with an L2 header and binds it to the associated routing table cache entry. See the section "Link Between Routing and L2 Header Caching" in Chapter 27.

Variables

Variables

 

neigh_tables

neigh_tables

注册协议列表。

List of registered protocols.

arp_tbl

arp_tbl

nd_tbl

nd_tbl

dn_neigh_table

dn_neigh_table

clip_tbl

clip_tbl

neigh_table定义内核中实现的四个相邻协议的四个结构。

The four neigh_table structures that define the four neighboring protocols implemented in the kernel.

Data structures

Data structures

 

struct neighbour

struct neighbour

struct neigh_table

struct neigh_table

struct neigh_parms

struct neigh_parms

struct neigh_ops

struct neigh_ops

struct hh_cache

struct hh_cache

struct neigh_statistics

struct neigh_statistics

主要数据结构,在第 27 章中描述,并在“本书本部分介绍的函数和变量”部分中以参考风格详细介绍。

Main data structures, described in Chapter 27 and detailed in reference style in the section "Functions and Variables Featured in This Part of the Book."

本书这一部分介绍的文件和目录

Files and Directories Featured in This Part of the Book

图 29-5显示了有关相邻子系统的章节中提到的主要文件和目录。

Figure 29-5 shows the main files and directories referred to in the chapters on the neighboring subsystem.

本书这一部分中的文件和目录

图 29-5。本书这一部分中的文件和目录

Figure 29-5. Files and directories featured in this part of the book




[ * ]当一个neighbour结构被放回内存池时neigh_destroy,它的内容不会被清除。

[*] When a neighbour structure is put back into the memory pool by neigh_destroy, its content is not cleared.

[ * ]此说法并非100%正确。由于neigh_parms结构用于调整相邻协议的行为,因此仅当至少有一个设备的 L3 配置使用相邻子系统时才需要它的存在。

[*] This statement is not 100% correct. Because a neigh_parms structure is used to tune the behavior of a neighboring protocol, its presence is needed only if there is at least one device whose L3 configuration uses the neighboring subsystem.

[ * ]通过 ND/IPv6,reachable_time还可以使用协议标头中的字段在路由器和主机之间显式交换。

[*] With ND/IPv6, reachable_time can also be explicitly exchanged between routers and hosts using a field in the protocol header.

[ * ]该字段的使用的一个很好的说明可以在net/core/neighbour.chh_lock中找到。这里,锁用于处理由于引用计数非零而无法删除条目的情况。neigh_destroyneighbour

[*] A good illustration of the use of the hh_lock field can be found in neigh_destroy in net/core/neighbour.c. Here the lock is used to handle the case of a neighbour entry that cannot be removed because its reference count number is nonzero.

第七部分。路由

Part VII. Routing

第三层协议(例如 IP)必须找出如何到达应该接收每个数据包的系统。收件人可能在隔壁的小隔间里,也可能在地球的另一边。当涉及多个网络时,L3 层负责找出最有效的路由(只要可行),并将消息沿着该路由引导到下一个系统(也称为下一跳。这个过程称为路由,它在 Linux 网络代码中起着核心作用。以下是每章的内容:

Layer three protocols, such as IP, must find out how to reach the system that is supposed to receive each packet. The recipient could be in the cubicle next door or halfway around the world. When more than one network is involved, the L3 layer is responsible for figuring out the most efficient route (so far as that is feasible) and for directing the message toward the next system along that route, also called the next hop. This process is called routing, and it plays a central role in the Linux networking code. Here is what is covered in each chapter:

第 30 章 路由:概念
Chapter 30 Routing: Concepts

介绍基本路由器以及 Linux 内核必须提供的功能。

Introduces the functionality that a basic router, and therefore the Linux kernel, must provide.

第 31 章 路由:高级
Chapter 31 Routing: Advanced

引入了用户可以在更复杂的场景中配置路由的可选功能。其中我们将看到策略路由和多路径路由。我们还将看看与路由交互的其他子系统。

Introduces optional features the user can enable to configure routing in more complex scenarios. Among them we will see policy routing and multipath routing. We will also look at the other subsystems routing interacts with.

第32章 路由:Linux实现
Chapter 32 Routing: Linux Implementation

概述路由代码使用的主要数据结构,描述路由子系统的初始化,并显示路由子系统与其他内核子系统之间的交互。

Gives you an overview of the main data structures used by the routing code, describes the initialization of the routing subsystem, and shows the interactions between the routing subsystem and other kernel subsystems.

第 33 章 路由:路由缓存
Chapter 33 Routing: The Routing Cache

描述路由缓存,包括协议无关缓存(目标缓存,或 DST)。该描述涵盖了如何在缓存中插入和删除元素,以及垃圾收集和查找算法。

Describes the routing cache, including the protocol-independent cache (destination cache, or DST). The description covers how elements are inserted and deleted from the cache, along with the garbage collection and lookup algorithms.

第 34 章 路由:路由表
Chapter 34 Routing: Routing Tables

描述路由表的结构,以及如何添加和删除路由。

Describes the structure of the routing table, and how routes are added to and deleted from it.

第 35 章 路由:查找
Chapter 35 Routing: Lookups

描述使用和不使用策略路由的入口和出口流量的路由表查找。

Describes the routing table lookups, for both ingress and egress traffic, with and without policy routing.

第 36 章 路由:其他主题
Chapter 36 Routing: Miscellaneous Topics

通过对第 32 章中介绍的数据结构的详细描述以及对用户空间和内核之间的接口的描述来结束本书的这一部分。这包括对旧代和新一代管理工具(即 net-tools和 IPROUTE2 软件包)的描述。

Concludes this part of the book with a detailed description of the data structures introduced in Chapter 32, and a description of the interfaces between user space and kernel. This includes a description of the old and new generations of administrative tools, namely the net-tools and IPROUTE2 packages.

第 30 章路由:概念

Chapter 30. Routing: Concepts

图30-1显示了路由的位置子系统(灰色框)适合网络堆栈。该图并未包含所有细节(Netfilter、桥接等),但显示了路由前后遍历的其他主要内核子系统。

Figure 30-1 shows where the routing subsystem (the gray box) fits into the network stack. The figure does not include all the details (Netfilter, bridging, etc.) but shows the other major kernel subsystems that are traversed before and after routing.

路由子系统与其他主要网络子系统的关系

图 30-1。路由子系统与其他主要网络子系统的关系

Figure 30-1. Relationship between the routing subsystem and the other main network subsystems

为了解释一些功能或其实现的细节,我经常会显示用户空间配置的快照。如果您需要了解有关我在示例中使用的用户空间工具的更多信息,我们鼓励您使用第 36 章作为参考。

To explain some of the features or the details of their implementation, I'll often show snapshots of user-space configurations. You are encouraged to use Chapter 36 as a reference if you need to learn more about the user-space tools I employ in the examples.

关于路由的讨论将集中在 IPv4 网络上。不过,我将指出 IPv6 显着不同的方面。

The discussion on routing will focus on IPv4 networks. However, I will point out the aspects of IPv6 that differ significantly.

路由器、路由和路由表

Routers, Routes, and Routing Tables

最简单的形式是路由器可以定义为一种配备多个网络接口卡 (NIC) 的网络设备,并且使用其网络知识来适当转发入口流量。[ * ]

In its simplest form, a router can be defined as a network device that is equipped with more than one network interface card (NIC), and that uses its knowledge of the network to forward ingress traffic appropriately.[*]

决定入口数据包是寻址到本地主机还是应该转发所需的信息,以及在后一种情况下正确转发数据包所需的信息,存储在称为转发信息库 (FIB) 的数据库中。它通常简称为路由表

The information required to decide whether an ingress packet is addressed to the local host or should be forwarded, together with the information needed to correctly forward the packets in the latter case, is stored in a database called the Forwarding Information Base (FIB). It is often referred to simply as the routing table .

图 30-2显示了一个简单场景,其中一个 LAN 的主机配置在 10.0.0.0/24 子网上,并且 LAN 的主机使用路由器 RT 来访问 Internet。

Figure 30-2 shows a simple scenario with a LAN whose hosts are configured on the 10.0.0.0/24 subnet, and a router, RT, that is used by the hosts of the LAN to reach the Internet.

路由器和路由表的基本示例

图 30-2。路由器和路由表的基本示例

Figure 30-2. Basic example of router and routing table

大多数主机(不是路由器)只有一个接口。主机配置为使用默认网关来访问任何非本地地址。因此,在图 30-2中,10.0.0.0/24 网络(由 0.0.0.0/0 指定)之外的任何主机的流量都会发送到 10.0.0.1 上的网关。对于 10.0.0.0/24 网络上的主机,使用第 VI 部分中描述的相邻子系统。

Most hosts, not being routers, have only one interface. The host is configured to use a default gateway to reach any nonlocal addresses. Thus, in Figure 30-2, traffic for any host outside the 10.0.0.0/24 network (designated by 0.0.0.0/0) is sent to the gateway on 10.0.0.1. For hosts on the 10.0.0.0/24 network, the neighboring subsystem described in Part VI is used.

无论主机在网络中扮演什么角色,每台主机都会维护一个路由表,每当需要处理网络流量时(发送和接收时)都会查阅该路由表。路由器可能需要运行主机通常不需要的专用软件,称为路由协议 ; 毕竟,它们需要更多关于如何访问远程网络的知识,而非路由器主机则依赖于它们。路由协议超出了本书的范围。

Regardless of the role played by a host in the network, each host maintains a routing table that it consults whenever it needs to handle network traffic, both when sending and receiving. Routers may need to run specialized software that is not usually needed by hosts, called routing protocols ; after all, they need more knowledge about how to reach remote networks, and the nonrouter hosts depend on them for that. The routing protocols are beyond the scope of this book.

在特定场景下,主机所需的路由能力可能会进一步降低,例如第28章“代理ARP服务器作为路由器”一节中描述的情况。然而,在本章中,我们将坚持刚刚列出的常见情况。

The routing capabilities required by hosts may be reduced even further under specific scenarios, such as the one described in the section "Proxy ARP Server as Router" in Chapter 28. In this chapter, however, we will stick to the common case just laid out.

路由表只不过是路由的集合。路由 是参数的集合,用于存储将流量转发到给定目的地所需的信息第32章中,我们将详细了解Linux如何定义路由,但我们可以在这里预期定义路由所需的最小参数集。让我们再次使用图 30-2作为参考。

The routing table is nothing but a collection of routes . A route is a collection of parameters used to store the information necessary to forward traffic toward a given destination. In Chapter 32, we will see in detail how Linux defines a route, but we can anticipate here the minimum set of parameters needed to define a route. Let's use Figure 30-2 again as a reference.

目的地网络
Destination network

路由表用于将流量转发到目的地。毫不奇怪,这是路由查找例程使用的最重要的字段。图 30-2显示了包含两条路由的路由表:一条通向本地子网 10.0.0.0/24,另一条通向其他任何地方。后者称为默认路由 并在表中记录为全零网络(请参阅“默认网关选择”部分)。

The routing table is used to forward traffic toward its destination. It should not come as a surprise that this is the most important field used by the routing lookup routines. Figure 30-2 shows a routing table with two routes: one that leads to the local subnet 10.0.0.0/24 and another one that leads every where else. The latter is called the default route and is recorded as a network of all zeros in the table (see the section "Default Gateway Selection").

出口装置
Egress device

这是与该路由匹配的数据包应该被发送出去的设备。例如,发送到地址 10.0.0.100 的数据包将被发送到 eth0

This is the device out of which packets matching this route should be transmitted. For example, packets sent to the address 10.0.0.100 would be sent out eth0.

下一跳网关
Next hop gateway

当目的网络与本地主机没有直连时,需要依靠其他路由器才能到达。例如,图30-2中的主机需要依赖路由器RT来到达位于10.0.0.0/24子网之外的任何主机。下一跳网关是该路由器的地址。

When the destination network is not directly connected to the local host, you need to rely on other routers to reach it. For example, the host in Figure 30-2 needs to rely on the router RT to reach any host located outside the 10.0.0.0/24 subnet. The next-hop gateway is the address of that router.

非路由多宿主主机

Nonrouting Multihomed Hosts

前面我说过,路由器通常有多个网卡,因为它的主要工作是将一个接口接收到的数据转发到另一个接口。然而,非路由主机(尤其是服务器)也可以拥有多个 NIC,而不实际执行任何数据包转发。由于以下一个或多个原因,大型服务器具有多个 NIC 的情况并不少见:

Earlier, I said that a router usually has more than one NIC, given that its main job is to forward data received on one interface out to another. However, nonrouting hosts—especially servers—can also have multiple NICs without actually doing any packet forwarding. It is not uncommon for a big server to have multiple NICs for one or more of the following reasons:

高可用性
High availability

如果一个接口出现故障或出现故障,流量可由第二个接口(也可能连接到不同的 LAN)接管。

If one interface goes down or fails, traffic can be taken over by a second one (which may be connected to a different LAN as well).

更强的路由功能
Greater routing capabilities

服务器可以配置更多路由,而不仅仅是一条默认路由。例如,出于特定原因(例如,为了促进系统日志记录),它可能会使用静态路由或多个 NIC 来到达特定主机或子网。图 30-3显示了一个示例,其中多宿主主机具有连接到另一个 LAN 的第二个 NIC,以使其到达主机 A。请注意,多宿主主机不会在两个 LAN 之间转发流量;它会在两个 LAN 之间转发流量。否则根据定义它就是路由器。

The server may be configured with more routes than just one default. For instance, it may use static routes or multiple NICs to reach specific hosts or subnets for particular reasons (for instance, to facilitate system logging). Figure 30-3 shows an example where a multihomed host has a second NIC connected to another LAN to let it reach Host A. Note that the multihomed host does not forward traffic between the two LANs; otherwise it would be a router by definition.

引导
Channeling

可以将多个接口绑定在一起,并使它们对于路由子系统来说看起来像一个接口。这个额外的层(对路由子系统来说是透明的)可以增加给定连接上的整体带宽,这对于高负载服务器来说是一个很有价值的功能。

It is possible to bind together multiple interfaces and make them look like a single one to the routing subsystem. This extra layer (which is transparent to the routing subsystem) can increase the overall bandwidth over a given connection, which can be a valuable feature for highly loaded servers.

多宿主主机示例

图 30-3。多宿主主机示例

Figure 30-3. Example of a multihomed host

在上述任何一种情况下,主机都不被视为路由器,因为它不会将流量从一个接口转发到另一个接口。另一种说法是,这样的主机永远不会接收寻址到除其自身之外的任何主机的流量(其中“自身”包括广播和多播流量),除非出现错误或在非常特定的条件下(代理、混杂接口等)。多播和广播流量可以被视为寻址到主机的流量。

In none of the preceding cases is the host considered a router, because it does not forward traffic from one interface to another. Another way to say this is that such a host never receives traffic addressed to any host but itself (where "itself" includes broadcast and multicast traffic), except in error or under very specific conditions (proxying, promiscuous interfaces, etc.). Multicast and broadcast traffic can be considered traffic addressed to the host.

多种路由配置

Varieties of Routing Configurations

路由是一个复杂的话题;我们无法分析所有可能的场景、问题和解决方案。然而,重要的是要了解其中的一些条件,以便浏览源代码并理解为什么要考虑并特殊处理一些看似多余的条件。

Routing is a complex topic; we will not be able to analyze all the possible scenarios, problems, and solutions. However, it is important to be aware of some of them to go through the source code and understand why some seemingly superfluous conditions are taken into consideration and handled specially.

图 30-4显示了您应该了解的三种配置,以便理解路由子系统的设计。这些配置中的路由器名为 R n。让我们看看这些案例有什么特别之处:

Figure 30-4 shows three configurations you should understand to make sense of the design of the routing subsystem. The routers in these configurations are named Rn. Let's see what is so special about these cases:

  • (a) 这是最常见的情况,不同的接口配置在不同的子网上,并且每个子网与不同的 LAN 关联。

  • (a) This is the most common case, where different interfaces are configured on different subnets, and each subnet is associated with a different LAN.

  • (b) 路由器 RT 在同一 LAN 上有两个接口(如路由器下方所示),但它们配置在两个不同的子网上。

  • (b) Router RT has two interfaces on the same LAN (shown below the router), but they are configured on two different subnets.

  • (c) 路由器 RT 在每个子网 10.0.2.0/24 和 10.0.3.0/24 上仍然有一个地址,但这两个地址已配置在同一 NIC 上。这可以通过两种不同的方式来完成:使用 IPROUTE2 引入的多 IP 地址功能,或创建旧式别名接口。我们将在本章后面简要比较这两种方法。

  • (c) Router RT still has one address on each subnet 10.0.2.0/24 and 10.0.3.0/24, but both of those addresses have been configured on the same NIC. This can be accomplished in two different ways: by using the multiple IP address capability introduced with IPROUTE2, or by creating old-style aliasing interfaces. We will briefly compare the two approaches later in this chapter.

情况 (b) 和 (c) 并不常见,但它们完全合法,并显示了 Linux 和 IP 的灵活性。您可能还不清楚它们的含义。我们将在本章后面指出它们并证明它们的合理性,但让我们从一些简单的含义开始。

Cases (b) and (c) are not common, but they are perfectly legitimate and show how flexible Linux and IP are. Their implications may not be clear to you yet. We will point them out and justify them later in this chapter, but let's start with a couple of simple implications.

  • LAN 是一个广播域。属于同一 L2 广播域的所有主机都会接收彼此的广播。这意味着在情况(b)和(c)中,如果RT(或网络10.0.2.0/24中的任何其他主机)向广播地址10.0.2.255发送数据包,则子网10.0.3.0/24中的所有主机将接收它(即使他们会丢弃它),当然包括 RT。

  • A LAN is a broadcast domain. All the hosts that belong to the same L2 broadcast domain receive each other's broadcast. This means that in cases (b) and (c), if RT (or any other host in network 10.0.2.0/24) sends a packet to the broadcast address 10.0.2.255, all the hosts of subnet 10.0.3.0/24 will receive it (even though they will discard it), including, of course, RT.

  • 入口接口不一定与出口接口不同,尽管通常是不同的。转发通常包括在一个接口上接收数据包并将其重新传输到另一个接口。然而,在情况 (c) 中,RT 可以在一个子网上接收数据包,并使用相同的 NIC 将其转发到同一 LAN 上的另一个子网上。

  • The ingress interface is not necessarily different from the egress interface, although it usually is. Forwarding usually consists of receiving a packet on one interface and retransmitting it out to another one. In case (c), however, RT can receive a packet on one subnet and forward it to the other one on the same LAN using the same NIC.

第 26 章中,我们看到了图 30-4(b)和 30-4(c)中的设置对下层邻居协议的影响。在本章中,我们将研究有关路由的含义

In Chapter 26, we saw the implications of the setups in Figure 30-4(b) and 30-4(c) on lower-layer neighboring protocols. In this chapter, we will look at the implications with regard to routing .

本书这一部分回答的问题

Questions Answered in This Part of the Book

此时,您可能会问自己一般性问题,例如:

At this point, you may be asking yourself general questions such as:

  • 如果路由器要转发数据包,内核如何知道转发已启用?

  • If a router is supposed to forward packets, how does the kernel know that forwarding is enabled?

  • 您是否在全局或接口对之间启用路由?

  • Is routing something you enable globally or between interface pairs?

  • 是否存在可以显着影响 Linux 路由器性能的调整参数?

  • Are there tuning parameters that can significantly influence the performance of a Linux router?

  • 路由表的语法是什么?

  • What is the syntax of the routing table?

或者更具体的,例如:

Or more specific ones such as:

  • 用于查找转发数据包所需信息的算法是什么?

  • What is the algorithm used to find the information needed to forward a packet?

  • 路由表仅用于转发流量,还是有其他用途?

  • Is the routing table used only to forward traffic, or is there any other use for it?

  • 内核如何与用户空间中运行的动态路由协议守护进程交互?

  • How does the kernel interact with dynamic routing protocol daemons running in user space?

通过本章和接下来的路由章节,您将能够回答这两类问题。

With this and the following routing chapters, you'll be able to answer both kinds of questions.

路由的基本要素

Essential Elements of Routing

在本节中,我将介绍路由环境的一些术语和基本元素。清楚地理解本书这一部分中广泛使用的几个关键术语的含义非常重要,这些术语在相关内核代码中作为变量和函数名称的一部分出现。幸运的是,路由代码使用的命名约定非常一致。

In this section, I'll introduce some terms and basic elements of the routing landscape. It's important to have a clear understanding of the meanings of a few key terms that are used extensively in this part of the book, and that appear as part of the variable and function names in the associated kernel code. Fortunately, the routing code uses naming conventions pretty consistently.

网络拓扑示例。

图 30-4。网络拓扑示例。

Figure 30-4. Examples of network topologies.

一些定义很简单,如下表所示。其他概念在各自的小节中介绍。

A few definitions are simple and are shown in the following list. Other concepts are presented in their own subsections.

互联网服务提供商 (ISP)
Internet Service Provider (ISP)

提供互联网访问的公司或组织。

Company or organization that provides access to the Internet.

转发信息库 (FIB)
Forwarding Information Base (FIB)

这只是路由表。请参阅前面的“路由器、路由和路由表”部分。

This is simply the routing table. See the earlier section "Routers, Routes, and Routing Tables."

对称路由和非对称路由
Symmetric routes and asymmetric routes

通常情况下,从Host A到Host B的路由与从Host B返回Host A的路由相同;该路线称为 对称路线 。在复杂的设置中,返回的路线可能会有所不同;在这种情况下,它是 不对称的

Usually, the route taken from Host A to Host B is the same as the route used to get back from Host B to Host A; the route is then called symmetric . In complex setups, the route back may be different; in this case, it is asymmetric.

指标
Metrics

度量是可以在路由上配置的可选参数。不要将这些指标与路由协议使用的指标混淆:后者使用指标来量化路由的好坏。路由协议度量的示例有端到端延迟、跳数、配置权重或成本等。

当您使用 IPROUTE2 配置路由时,您可以提供称为度量的附加参数,如“路由的基本元素”部分中所定义。其中之一——路径最大传输单元,或路径 MTU——在第 18 章中描述。其他值被传输控制协议(TCP)用作内部变量的起始值,这些变量稍后可以由协议进行调整。你可以参考任何一本关于TCP的书来了解它们的含义和使用:

  • 窗户

  • 往返

  • 往返时间变化

  • 慢启动阈值

  • 拥塞窗口

  • 广告的最大分段大小

  • 重新排序

A metric is an optional parameter that can be configured on a route. Do not confuse these metrics with the ones used by routing protocols: the latter use metrics to quantify how good a route is. Examples of routing protocol metrics are the end-to-end delay, the number of hops, a configuration weight or cost, etc.

When you configure a route with IPROUTE2, you can provide additional parameters called metrics, as defined in the section "Essential Elements of Routing." One of them—Path Maximum Transmission Unit, or Path MTU—is described in Chapter 18. Others are used by the Transmission Control Protocol (TCP) as starting values for internal variables that may later be adjusted by the protocol. You can refer to any book on TCP for their meaning and use:

  • Window

  • Round trip

  • Round-trip time variation

  • Slow-start threshold

  • Congestion window

  • Maximum segment size to advertise

  • Reordering

领域
Realm

数字域标识符。请参阅第 31 章中的“基于路由表的分类器” 部分。

A numerical domain identifier. See the section "Routing Table Based Classifier" in Chapter 31.

地址类别
Address class

IP地址分为不同的类别,如表30-1所示。表 30-2显示了每类 IP 地址的网络和主机组件的大小(请注意,D 类和 E 类是 C 类的特例)。

IP addresses are classified into various classes, shown in Table 30-1. Table 30-2 shows, for each class of IP addresses, the size of the network and host components (note that classes D and E are special cases of class C).

表 30-1。基于类别的 IPv4 地址分类

Table 30-1. Classification of IPv4 addresses based on class

班级

Class

第一个地址

First address

最后地址

Last address

地址的最左边位

Leftmost bits of addresses

A

A

0.0.0.0

0.0.0.0

127.255.255.255

127.255.255.255

0——

0——

B

128.0.0.0

128.0.0.0

192.255.255.255

192.255.255.255

10——

10—-

C

C

192.0.0.0

192.0.0.0

223.255.255.255

223.255.255.255

110—

110—

D(组播)

D (Multicast)

224.0.0.0

224.0.0.0

239.255.255.255

239.255.255.255

1110-

1110-

E(保留)

E (Reserved)

240.0.0.0

240.0.0.0

255.255.255.255

255.255.255.255

11111

11111

表 30-2。网络和主机组件

Table 30-2. Network and host components

班级

Class

网络地址部分的大小

Size of network address component

主机地址部分的大小

Size of host address component

主机数量(包括网络地址和广播地址)

Number of hosts (including network and broadcast addresses)

A

A

8

8

24

24

16,777,216 (2 24 )

16,777,216 (224)

B

16

16

16

16

65,535 (2 16 )

65,535 (216)

C

C

24

24

8

8

256 (2 8 )

256 (28)

可路由和不可路由地址
Routable and nonroutable addresses

IP 规范将某些地址范围(如表 30-3所示)预留为不可路由的,这意味着它们被保留用于 LAN。可路由地址必须由中央机构分发,并且在全球范围内是唯一的。相反,任何人都可以配置不可路由的地址,这些是大多数用户在其路由器后面的系统上拥有的。不可路由地址不能用于提供任何 Internet 服务,因为它们不是唯一的,并且 Internet 路由器不应将流量传递给它们。

127.0.0.0/8 子网是一个特殊的地址范围,其范围[ * ]只是配置它们的主机。任何数据包都不能离开以这些地址之一作为源或目标的主机。

表 30-3。不可路由和环回 IPv4 地址

地址

班级

10.0.0.0/8

1 个 A 类

172.16.0.0/16 至 172.31.0.0/16

16 个 B 级

192.168.0.0/16

256 个 C 级

127.0.0.0/8(环回)

1 个 A 类

The IP specifications have set aside certain ranges of addresses (shown in Table 30-3) as nonroutable, which means they are reserved for use on a LAN. Routable addresses must be handed out by centralized bodies and are unique worldwide. Anyone, in contrast, can configure nonroutable addresses , and these are the ones most users have on their systems behind their routers. Nonroutable addresses cannot be used to provide any Internet service because they are not unique and Internet routers are not supposed to pass traffic to them.

The 127.0.0.0/8 subnet is a special range of addresses whose scope[*] is just the host where they are configured. No packet can leave a host with one of these addresses as either the source or the destination.

Table 30-3. Nonroutable and loopback IPv4 addresses

Addresses

Class

10.0.0.0/8

1 x Class A

172.16.0.0/16 to 172.31.0.0/16

16 x Class B

192.168.0.0/16

256 x Class C

127.0.0.0/8 (Loopback)

1 x Class A

图 30-5显示了一种拓扑,其中两个子网使用相同范围的不可路由 IP 地址 10.0.1.0/24,一个子网使用可路由 IP 地址子网 100.0.1.0/24。对于来自 10.0.1.0/24 子网的主机与其子网外部的主机进行通信,其路由器必须使用某种形式的网络地址转换 (NAT) 来隐藏本地不可路由的子网。另请注意,每个主机默认配置为 127.0.0.1 地址。将三个路由器连接到其 ISP 的接口配置有 ISP 分配的可路由 IP 地址。

Figure 30-5 shows a topology with two subnets using the same range of nonroutable IP addresses 10.0.1.0/24, and one subnet using the routable subnet 100.0.1.0/24. For hosts from either 10.0.1.0/24 subnet to communicate with hosts outside their subnet, their routers must use some form of Network Address Translation (NAT) to hide the local, nonroutable subnets. Note also that each host is configured by default with the 127.0.0.1 address. The interfaces that connect the three routers to their ISPs are configured with routable IP addresses assigned by the ISPs.

可路由地址与不可路由地址

图 30-5。可路由地址与不可路由地址

Figure 30-5. Routable versus nonroutable addresses

范围

Scope

路由和IP地址都被分配了范围,它告诉内核它们在哪些上下文中有意义和可用。如果您了解范围的概念,您将更容易理解路由代码执行的各种健全性检查,以及不同范围的路由和 IP 地址之间的区别。

Both routes and IP addresses are assigned scopes, which tell the kernel the contexts in which they are meaningful and usable. If you understand the concept of scope, you will have an easier time understanding the various sanity checks done by the routing code, and the distinctions it makes between differently scoped routes and IP addresses.

范围 Linux 中的路由是到目标网络距离的指示器。IP地址的范围表示该地址距离本地主机有多远,这在某种程度上也告诉您该地址的所有者距离本地主机有多远。

The scope of a route in Linux is an indicator of the distance to the destination network. The scope of an IP address is an indication of how far from the local host the address is known, which, to some extent, also tells you how far the owner of that address is from the local host.

第 32 章提供了更详细的范围列表,但让我们在这里看几个示例,使用与代码中使用的术语非常相似的术语,以便更容易将代码与这些概念相关联。

Chapter 32 offers a more detailed list of scopes, but let's see a few examples here, using a terminology very similar to the one used in the code so that it will be easier to associate the code with these concepts.

让我们从 IP 地址的常见范围开始:

Let's start with common scopes for IP addresses:

主持人
Host

当地址仅用于在主机本身内进行通信时,该地址具有主机范围。在主机外部,该地址是未知的并且无法使用。例如环回地址 127.0.0.1。

An address has host scope when it is used only to communicate within the host itself. Outside the host this address is not known and cannot be used. An example is the loopback address, 127.0.0.1.

关联
Link

当地址有意义并且只能在 LAN(即每台计算机在链路层上相互连接的网络)内使用时,该地址就具有链路范围。一个示例是子网的广播地址。发送到子网广播地址的数据包由该子网上的主机发送到同一子网上的其他主机。[ * ]

An address has link scope when it is meaningful and can be used only within a LAN (that is, a network on which every computer is connected to every other one on the link layer). An example is a subnet's broadcast address. Packets sent to the subnet broadcast address are sent by a host on that subnet to the other hosts on the same subnet.[*]

宇宙
Universe

当一个地址可以在任何地方使用时,它就具有全域范围。这是大多数地址的默认范围。

An address has universe scope when it can be used anywhere. This is the default scope for most addresses.

请注意,范围并不反映不可路由(私有)和可路由(公共)地址之间的区别。10.0.0.1(不可路由)和 165.12.12.1(可路由)都可以指定链接范围或 Universe 范围。范围由系统管理员在配置地址时分配(或通过配置命令分配默认值)。由于 Universe 范围是所提到的两个地址的默认范围,因此如果需要不同的内容,管理员必须明确指定范围。广播和环回地址由内核自动分配适当的范围。

Note that the scope does not reflect the distinction between nonroutable (private) and routable (public) addresses. Both 10.0.0.1 (which is nonroutable) and 165.12.12.1 (which is routable) can be given either link or universe scope. The scope is assigned by the system administrator when she configures the addresses (or is assigned a default value by the configuration commands). Since universe scope is the default for both of the addresses mentioned, the administrator must explicitly specify a scope if something different is desired. The broadcast and loopback addresses are assigned the proper scope automatically by the kernel.

现在让我们看看这三个作用域应用于路由时的含义:

Let's see now the meaning of the same three scopes when applied to routes:

主持人
Host

当路由指向本地主机上的目标地址时,该路由就具有主机范围。

A route has host scope when it leads to a destination address on the local host.

关联
Link

当路由通向本地网络上的目标地址时,该路由就具有链路范围。

A route has link scope when it leads to a destination address on the local network.

宇宙
Universe

当一条路由指向超过一跳之外的地址时,该路由就具有全域范围。

A route has universe scope when it leads to addresses more than one hop away.

我们将在第 32 章的“添加 IP 地址”一节中看到,Linux 为每个配置的本地地址创建一条路由,并为每个配置的子网的广播地址创建一条路由。该部分应该帮助您理解地址范围和路由范围之间的关系。

We will see in the section "Adding an IP address" in Chapter 32 that Linux creates a route for each local address configured, plus one for the broadcast address of each configured subnet. That section should help you understand the relationship between the scopes of addresses and of routes.

使用范围

Use of the scope

地址和路由的范围被路由代码和内核的其他部分广泛使用。

The scope of both addresses and routes is used extensively by the routing code and other parts of the kernel.

首先,请记住,在 Linux 中,即使管理员在接口上配置了 IP 地址,但地址属于主机,而不属于接口。有关详细信息,请参阅第 28 章中的“从多个接口响应” 部分。

First of all, remember that in Linux, even though an administrator configures IP addresses on interfaces, addresses belong to the host, not to the interfaces. See the section "Responding from Multiple Interfaces" in Chapter 28 for more details.

主机配置多个地址的情况并不罕见,无论是在单个接口上还是在多个接口上。当本地系统传输数据包时,内核需要选择要使用的源IP地址。当主机只有一个配置了单个 IP 地址的 NIC 时,这一点很简单,但当您运行具有不同范围的多个地址的复杂设置时,这一点就不那么明显了。根据目标地址的位置,您可能更愿意选择具有特定范围的源 IP 地址,然后目标可以使用该地址返回流量或在远程站点用于其他目的。

It is not uncommon for a host to be configured with multiple addresses , either on a single interface or on multiple interfaces. When the local system transmits a packet, the kernel needs to select what source IP address to use. This is trivial when the host has only one NIC with a single IP address configured, but it is less obvious when you run a complex setup with multiple addresses of different scopes. Depending on the location of the destination address, you may prefer to select a source IP address with a specific scope, which the destination can then use to return traffic or for other purposes at the remote site.

路由代码还使用作用域对配置实施简单而强大的健全性检查。假设您需要将数据包传输到远程主机 B,而在本地主机上配置的任何子网中都无法直接访问该主机 B。路由查找将返回要使用的网关的地址,例如 RT。现在您知道要到达主机 B,您需要将数据包发送到 RT,RT 将负责转发它。为了避免环路,RT 必须比您更接近目的地。也就是说,到Host B 的路由范围必须大于到RT 的路由范围。(也有例外,通常是特殊配置所需要的。)

The routing code also uses scopes to enforce simple yet powerful sanity checks on the configuration. Suppose you need to transmit a packet to remote Host B, which is not directly reachable in any of the subnets configured on the local host. A routing lookup will return you the address of the gateway to use—say, RT. Now you know that to reach Host B, you need to send your packet to RT, which will take care of forwarding it. To avoid a loop, RT must be closer to the destination than you are. In other words, the scope of the route to Host B must be wider than the scope of the route toward RT. (There are exceptions, which are often required by special configurations.)

让我们看一个使用图 30-6拓扑的示例。为了使主机 A 到达主机 B,前者的路由查找将返回经过 10.0.1.1 的默认路由,其范围为 RT_SCOPE_UNIVERSE。根据图中所示的另一条路由,可以通过 A 的eth0接口直接访问网关地址 10.0.1.1 。第二条路由的范围RT_SCOPE_LINK比前一个范围窄,因此可以使用该接口将数据包发送到范围更广的地址。

Let's look at an example using the topology of Figure 30-6. For Host A to reach Host B, a routing lookup on the former returns the default route via 10.0.1.1, whose scope is RT_SCOPE_UNIVERSE. The gateway's address 10.0.1.1 is reachable directly via A's eth0 interface, according to the other route shown in the figure. This second route has scope RT_SCOPE_LINK, which is narrower than the previous scope and therefore enables the interface to be used to send the packet to the address with the broader scope.

简单的网络拓扑

图 30-6。简单的网络拓扑

Figure 30-6. Simple network topology

在第33章的“出口查找”部分中,您可以找到使用涉及ARP的作用域的示例。

In the section "Egress lookup" in Chapter 33, you can find an example of using scope involving ARP.

默认网关

Default Gateway

默认网关 通常称为 0.0.0.0/0 路由,是在没有到目的地的显式路由时使用的路由。[ * ]连接到 Internet 的单个主机通常配置有一条到本地网络的路由(间接源自 NIC 的配置)和一条用于访问 Internet 的默认路由(通常由 ISP 提供)。另一方面,路由器可能配置也可能不配置默认路由;这取决于路由器在网络拓扑中的位置以及路由器所扮演的角色。

The default gateway , often referred to as the 0.0.0.0/0 route, is the one used when there is no explicit route to a destination.[*] A single host connected to the Internet is usually configured with a route to the local network (which is indirectly derived from the NIC's configuration) and one default route (usually given by the ISP) that is used to reach the Internet. A router, on the other hand, may or may not be configured with default routes; it depends on where the router is placed in the network topology and what role the router plays.

Linux 内核对可以配置的默认网关的数量没有任何限制。详细信息请参见第 35 章

The Linux kernel does not have any restriction on the number of default gateways you can configure. See Chapter 35 for details.

定向广播

Directed Broadcasts

广播数据包只是发送到子网的广播地址的数据包。子网广播通常由位于同一子网内的主机生成。这意味着广播数据包将发送至其自己子网中的所有主机。

A broadcast packet is simply a packet sent to a subnet's broadcast address. Subnet broadcasts are usually generated by hosts located within the same subnet. This means that the broadcast packet is addressed to all the hosts in its own subnet.

另一方面,定向广播寻址到远程子网的广播地址。使用定向广播的一个示例是 SAMBA 服务器使用的远程通告功能来通告远程子网上的资源(打印机或文件夹)。

A directed broadcast, on the other hand, is addressed to the broadcast address of a remote subnet. An example of the use of a directed broadcast is the remote announce feature used by SAMBA servers to advertise resources (printers or folders) on remote subnets.

让我们举例说明定向广播 参见图30-7(a)。当主机A向地址10.0.0.255发送IP数据包时,会产生本地广播(本例不在图中)。相反,当它将数据包发送到地址 10.0.1.255 时,它会生成定向广播。在我们的示例中,主机 A 和目标网络 (10.0.1.0/24) 仅相隔一跳,但它们本来可以更远;定向广播的定义包括发送者不属于其广播地址的本地网络的任何情况。

Let's illustrate directed broadcasts by referring to Figure 30-7(a). When Host A sends an IP packet to the address 10.0.0.255, it generates a local broadcast (this case is not in the figure). When it sends a packet to the address 10.0.1.255 instead, it generates a directed broadcast. In our example, Host A and the destination network (10.0.1.0/24) are separated by only one hop, but they could have been more distant; the definition of a directed broadcast includes any case where the sender does not belong to the local network to which it addresses the broadcast.

仅当主机位于广播定向到的子网上时,主机才能识别定向广播。例如,在图30-7(c)中,RT1无法判断100.0.1.127是否是子网广播,但RT2可以。因此,定向广播只能被目标子网路径上的最后一个网关识别,因为该网关在该子网上配置了一个 IP 地址。

A host can identify a directed broadcast only if the host is on the subnet to which the broadcast is directed. For example, in Figure 30-7(c), RT1 cannot tell whether 100.0.1.127 is a subnet broadcast, but RT2 can. Directed broadcasts can therefore be recognized as such only by the last gateway on the path to the destination subnet, because that gateway has one IP address configured on that subnet.

您可能想知道为什么这很重要。从本质上讲,滥用定向广播可能会产生拒绝服务 (DoS) 攻击,不幸的是很难区分良性定向广播和恶意定向广播。然而,已知一种情况可能是恶意的:作为定向广播发送的 ICMP ECHO REQUEST 数据包(即 ping)。

You may wonder why this is important. Essentially, a misuse of directed broadcasts can generate Denial of Service (DoS) attacks, and it is unfortunately difficult to distinguish benign directed broadcasts from malign ones. One case, however, is known to probably be malign: ICMP ECHO REQUEST packets (i.e., pings) sent as directed broadcasts.

我们以图 30-7(a)和 (b) 中的示例为例,想象 10.0.0.200 使用源 IP 地址 10.0.0.100 向广播地址 10.0.1.255(不是它自己的子网)发送 ICMP ECHO REQUEST (这不是它自己的地址)。这将使远程子网 10.0.1.0/24 中的每个主机通过向 IP 地址为 10.0.0.100 的受害主机发送 ICMP ECHO REPLY 来回复 ICMP ECHO REQUEST。后者会简单地丢弃这些数据包,因为它从未发送任何 ICMP ECHO REQUEST,但即使丢弃大量数据包也会消耗 CPU。正如您可以想象的,如果 10.0.1.0/24 子网中有很多主机,受害者可能会被垃圾流量淹没。

Let's take the example in Figure 30-7(a) and (b) and imagine 10.0.0.200 sending an ICMP ECHO REQUEST to the broadcast address 10.0.1.255 (which is not its own subnet), using the source IP address 10.0.0.100 (which is not its own address). This would make each host in the remote subnet 10.0.1.0/24 reply to the ICMP ECHO REQUEST by sending an ICMP ECHO REPLY to the victim host with IP address 10.0.0.100. The latter would simply discard those packets, since it never sent any ICMP ECHO REQUEST, but even discarding a huge number of packets is CPU-consuming. As you can imagine, if there were a lot of hosts in the 10.0.1.0/24 subnet, the victim could be flooded with garbage traffic.

Linux 内核的路由子系统不允许您丢弃任何定向广播。(但是,您可以使用过滤子系统来清除它们。)Linux 确实会专门处理发送到广播地址的 ICMP ECHO REQUEST:当目标地址是本地地址时,管理员可以指示主机是否应该回复 ICMP ECHO REQUEST子网广播。

The routing subsystem of the Linux kernel does not allow you to discard any directed broadcasts. (You can, however, use the filtering subsystem to weed them out.) Linux does handle ICMP ECHO REQUESTS addressed to a broadcast address specially: the administrator can indicate whether a host should reply to an ICMP ECHO REQUEST when the destination address is a local subnet broadcast.

主要和次要地址

Primary and Secondary Addresses

有时需要在同一网卡上配置多个IP地址。例如,这可能是必需的,因为:

Sometimes it's necessary to configure multiple IP addresses on the same NIC. This may be required, for instance, because:

  • 您在同一主机上运行多个服务,并且您更喜欢使用不同的 IP 地址来通告每个服务。这也可以简化防火墙规则。

  • You run multiple services on the same host and you prefer to advertise each service with a different IP address. This can also simplify the firewall rules.

  • 您可能缺乏硬件,被迫暂时将两个子网合并到同一个集线器或交换机上。在这种情况下,单个 NIC 足以提供与两个子网的连接。

  • You may be short of hardware and forced to temporarily merge two subnets onto the same hub or switch. In that case, a single NIC would be sufficient to provide connectivity to both subnets.

您可能会惊讶地发现,当您在同一个 NIC 上配置多个 IP 地址时,即使为它们分配了相同的范围,内核的路由代码也可能不会将它们视为等效。可以通过将某些地址称为主要地址而将其他地址称为次要来进行区分 。

You may be surprised to hear that, when you configure multiple IP addresses on the same NIC, the kernel's routing code may not consider them equivalent even when they are assigned the same scope. Distinctions can be made by calling some addresses primary and others secondary.

在接口上配置 IP 地址时,始终需要提供网络掩码以及。当您不提供并且系统没有抱怨时,这意味着系统已为系统选择了默认网络掩码你。(例如,它可能基于 IP 地址所属的类别。请参阅“路由的基本元素”部分。)如果没有网络掩码,路由子系统将不知道可以通过该接口直接访问哪些地址。因此,每个地址都伴随着一个网络掩码,如果在同一接口上配置多个 IP 地址,则需要为每个地址指定一个网络掩码。这些网络掩码可能相同也可能不同,具体取决于您要实施的配置。

When you configure an IP address on an interface, you are always required to provide a netmask as well. When you do not provide it and the system does not complain, it means the system has selected a default netmask for you. (It may, for example, be based on the class the IP address belongs to. See the section "Essential Elements of Routing.") Without a netmask, the routing subsystem wouldn't know which addresses are directly reachable through that interface. So every address is accompanied by a netmask, and if you configure multiple IP addresses on the same interface, you need to specify a netmask for each one. Those netmasks may or may not be the same, depending on the configuration you want to enforce.

恶意定向广播的示例

图 30-7。恶意定向广播的示例

Figure 30-7. Examples of malign directed broadcasts

如果某个地址属于同一 NIC 上另一个已配置地址的子网内,则该地址被视为辅助地址。这包括子网相同的情况。因此,地址的配置顺序非常重要:在配置给定地址时,您无需明确说明该地址是主要地址还是次要地址,而是根据包含子网的现有地址的存在情况自动做出决定。

An address is considered secondary if it falls within the subnet of another already configured address on the same NIC. This includes the case where the subnets are the same. Thus, the order in which addresses are configured is important: you do not explicitly say that a given address is primary or secondary when you configure it, but the decision is made automatically based on the presence of an existing address encompassing the subnet.

让我们看几个例子。

Let's see a couple of examples.

以下是一个名为eth0的网卡配置了两个具有相同网络掩码的地址(先是 10.0.0.1/24,然后是 10.0.0.2/24)后的配置。由于这两个地址属于同一 10.0.0.0/24 子网,因此配置的第一个地址将是主地址,另一个地址将是辅助地址。

The following is the configuration of a single NIC named eth0 after it is configured with two addresses having the same netmask, first 10.0.0.1/24 and then 10.0.0.2/24. Since the two addresses fall within the same 10.0.0.0/24 subnet, the first one configured will be primary and the other one will be secondary.

[root@路由器内核]# ip address add 10.0.0.1/24 broadcast 10.0.0.255 dev eth0
[root@路由器内核]#ip address list dev eth0
4:eth0:<广播、组播、上行> mtu 1500 qdisc pfifo_fast qlen 100
    链接/以太 00:60:97:77:d1:8c brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.1/24 brd 10.0.0.255 范围全局 eth0
 
[root@路由器内核]# ip address add 10.0.0.2/24 broadcast 10.0.0.255 dev eth0
[root@路由器内核]#ip address list dev eth0
4:eth0:<广播、组播、上行> mtu 1500 qdisc pfifo_fast qlen 100
    链接/以太 00:60:97:77:d1:8c brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.1/24 brd 10.0.0.255 范围全局 eth0
    inet 10.0.0.2/24 brd 10.0.0.255 范围全局辅助 eth0
[root@router kernel]# ip address add 10.0.0.1/24 broadcast 10.0.0.255 dev eth0
[root@router kernel]# ip address list dev eth0
4: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100
    link/ether 00:60:97:77:d1:8c brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.1/24 brd 10.0.0.255 scope global eth0
 
[root@router kernel]# ip address add 10.0.0.2/24 broadcast 10.0.0.255 dev eth0
[root@router kernel]# ip address list dev eth0
4: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100
    link/ether 00:60:97:77:d1:8c brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.1/24 brd 10.0.0.255 scope global eth0
    inet 10.0.0.2/24 brd 10.0.0.255 scope global secondary eth0

每个接口可以有任意数量的主地址和辅助地址。对于特定网络掩码(本例中为 /24 网络掩码),只有一个地址可以是主要地址。如果我们添加第三个地址(例如 10.0.0.3/24),它将被归类为与主地址 10.0.0.1/24 关联的辅助地址。

Each interface can have as many primary and secondary addresses as you like. For a particular netmask (the /24 netmask in this case), only one address can be primary. If we added a third address—say, 10.0.0.3/24—it would be classified as a secondary address associated with the primary address 10.0.0.1/24.

另一方面,10.0.0.1/24 和 10.0.0.3/25 位于不同的子网中(因为网络掩码不同),即使它们覆盖了重叠的地址范围。因此,如果我们将 10.0.0.3/25 地址添加到前两个地址中,它将被归类为eth0上的另一个主地址。这将是ip 地址列表的输出:

On the other hand, 10.0.0.1/24 and 10.0.0.3/25 are on different subnets (because of the different netmasks) even though they cover an overlapping range of addresses. Therefore, if we added the 10.0.0.3/25 address to the previous two, it would be classified as another primary address on eth0. This would be the output of ip address list:

[root@路由器内核]# ip address add 10.0.0.3/25 broadcast 10.0.0.127 dev eth0
[root@路由器内核]#ip address list dev eth0
4:eth0:<广播、组播、上行> mtu 1500 qdisc pfifo_fast qlen 100
    链接/以太 00:60:97:77:d1:8c brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.1/24 brd 10.0.0.255 范围全局 eth0
    inet 10.0.0.2/24 brd 10.0.0.255 范围全局辅助 eth0
    inet 10.0.0.3/25 brd 10.0.0.127 范围全局 eth0
[root@router kernel]# ip address add 10.0.0.3/25 broadcast 10.0.0.127 dev eth0
[root@router kernel]# ip address list dev eth0
4: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100
    link/ether 00:60:97:77:d1:8c brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.1/24 brd 10.0.0.255 scope global eth0
    inet 10.0.0.2/24 brd 10.0.0.255 scope global secondary eth0
    inet 10.0.0.3/25 brd 10.0.0.127 scope global eth0

在第 35 章的“辅助例程”一节中,我们将看到当存在多个具有重叠子网的主地址时,内核如何设法选择一个 IP 地址。

In the section "Helper Routines" in Chapter 35 we will see how the kernel manages to select one IP address when there are multiple primary addresses with overlapping subnets.

简而言之,决定主从状态的不仅是 IP 地址:还需要考虑网络掩码,因为它标识子网。配置时 一个接口上有多个 IP 地址,了解主地址和辅助地址之间的区别非常重要。查看路由代码时也很重要。我们将在第 32 章中看到,对许多事件和条件的响应取决于 IP 地址是主要还是次要。这里有些例子:

In short, it is not only the IP address that decides the primary-secondary status: you also need to take into account the netmask because it identifies the subnet. When configuring multiple IP addresses on an interface, it is important to understand the difference between primary and secondary addresses. It is also important when looking at the routing code. We will see in Chapter 32 that the response to many events and conditions depends on whether the IP address is primary or secondary. Here are some examples:

  • 主地址会影响运行应用配置的代码的 CPU 的熵。

  • Primary addresses contribute to the entropy of the CPU that happens to run the code that applies the configuration.

  • 当您删除主地址时,所有关联的辅助地址也会被删除。有一个可通过/proc配置的选项,允许在当前主地址被删除时将辅助地址提升为主地址(请参阅第 18 章)。

  • When you delete a primary address, all the associated secondary addresses are also removed. There is an option, configurable via /proc, that allows secondary addresses to be promoted to primary when the current primary address is removed (see Chapter 18).

  • 当主机为本地生成的流量选择源 IP 地址时,它仅考虑主地址。

  • When a host selects the source IP address for locally generated traffic, it considers only primary addresses.

老一代配置:别名接口

Old-generation configuration: aliasing interfaces

您可能已经注意到,在前面的章节中,我总是使用ip address 命令来配置地址。这是有充分理由的:ifconfig是老一代的接口配置命令,也是 Unix 管理员为此目的最常用的工具,它无法区分主地址和辅助地址。ifconfig甚至不显示辅助地址,因此ifconfigip 地址列表 的输出在前面部分的示例中不会匹配。第36章提供了ifconfig之间更深入的比较 ,由 Linux 的 net-tools 软件包提供,以及新一代IP 地址工具,由 IPROUTE2 软件包提供。

You may have noticed that in the previous sections, I always used the ip address command to configure addresses. That's for a good reason: ifconfig, the old-generation interface configuration command and the most common tool used by Unix administrators for this purpose, cannot distinguish between primary and secondary addresses. ifconfig does not even show the secondary addresses, so the output of ifconfig and ip address list would not match in the examples in the previous sections. Chapter 36 offers a deeper comparison between ifconfig, offered by Linux's net-tools package, and the new-generation ip address tool, offered by the IPROUTE2 package.

在引入 IPROUTE2 及其高级路由功能之前,Linux 使用别名 接口的概念,该概念在较新的内核中仍然可用,以实现向后兼容性。使用ifconfig在单个 NIC 上配置多个地址的唯一方法是定义虚拟设备,如eth0:0eth0:1等。每个虚拟设备都可以用作真正的 NIC:您可以在其上配置一个地址,然后使用它配置路由时作为设备等等。

Before the introduction of IPROUTE2 and its advanced routing capabilities, Linux used the concept of aliasing interfaces, which is still available with newer kernels for backward compatibility. The only way to configure multiple addresses on a single NIC with ifconfig was to define virtual devices like eth0:0, eth0:1, etc. Each virtual device could be used as a real NIC: you could configure an address on it, use it as a device when configuring routing, and so on.

别名设备与主/从状态的关系

Relationship between aliasing devices and primary/secondary status

由于内核同时支持 IPROUTE2 的高级功能和旧式别名,因此这两种模型需要以某种方式共存。我们将在第 32 章中看到内核内部的细节,但我们可以在这里从用户空间的角度来研究两者如何共存。

Because the kernel supports both the advanced capabilities of IPROUTE2 and old-style aliasing, those two models need to coexist somehow. We will see the details of the kernel internals in Chapter 32, but we can examine here how the two coexist from a user-space perspective.

配置别名设备时,仍会根据我们在“主要和辅助地址”部分中介绍的相同规则来分配主要/辅助状态。但是, ip 地址列表的输出现在添加了对别名设备的引用。以下快照显示了一个示例,其中我们从配置了一个地址 ( eth1 ) 的接口开始,在别名设备 ( eth1:1 )上的同一子网中添加一个地址,然后在另一个别名设备上的不同子网中添加另一个地址(eth1:2)。由于子网不同,eth1:1成为次要的,eth1:2 成为主要的。

When you configure an aliasing device, the primary/secondary status is still assigned based on the same rule we introduced in the section "Primary and Secondary Addresses." However, the output of ip address list now adds a reference to the aliasing device. The following snapshot shows an example where we start with an interface with one configured address (eth1), add an address within the same subnet on an aliasing device (eth1:1), and then add another address in a different subnet on another aliasing device (eth1:2). Because of the differing subnets, eth1:1 becomes secondary and eth1:2 becomes primary.

[root@路由器内核]#ip address list
...
11:eth1:<广播、组播、上行> mtu 1500 qdisc pfifo_fast qlen 1000
    链接/以太 00:0a:41:04:bd:16 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.101/24 brd 192.168.1.255 范围全局 eth1
...
[root@路由器内核]# ifconfig eth1:1 192.168.1.102 netmask 255.255.255.0
[root@路由器内核]# ifconfig eth1:2 192.168.1.103 netmask 255.255.255.128
[root@路由器内核]#ip address list
...
11:eth1:<广播、组播、上行> mtu 1500 qdisc pfifo_fast qlen 1000
    链接/以太 00:0a:41:04:bd:16 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.101/24 brd 192.168.1.255 范围全局 eth1
    inet 192.168.1.103/25 brd 192.168.1.255 范围全局 eth1:2
    inet 192.168.1.102/24 brd 192.168.1.255 范围全局辅助 eth1:1
...
[root@router kernel]# ip address list
...
11: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0a:41:04:bd:16 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.101/24 brd 192.168.1.255 scope global eth1
...
[root@router kernel]# ifconfig eth1:1 192.168.1.102 netmask 255.255.255.0
[root@router kernel]# ifconfig eth1:2 192.168.1.103 netmask 255.255.255.128
[root@router kernel]# ip address list
...
11: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0a:41:04:bd:16 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.101/24 brd 192.168.1.255 scope global eth1
    inet 192.168.1.103/25 brd 192.168.1.255 scope global eth1:2
    inet 192.168.1.102/24 brd 192.168.1.255 scope global secondary eth1:1
...

一个明显的问题是您是否可以使用 IPROUTE2 在别名设备上配置多个地址。这是不可能的,因为 IPROUTE2 不像ifconfig那样将别名设备视为真实的独立设备 :IPROUTE2 的别名设备只是地址上的标签。

An obvious question is whether you can configure multiple addresses on an aliasing device using IPROUTE2. This is not possible, because IPROUTE2 does not treat aliasing devices as real, independent devices as ifconfig does: an aliasing device to IPROUTE2 is just a label on an address.

[root@router kernel]# ip地址add 192.168.1.104/24 dev eth1:1
找不到设备“eth1:1”
[root@router kernel]# ip address add 192.168.1.104/24 dev eth1:1
Cannot find device "eth1:1"

路由表

Routing Table

路由表是路由系统的核心。在最简单的定义中,它由一个路由数据库组成,其他子系统(例如 IPv4)可以通过各种功能使用该数据库,其中最重要的是用于查找的功能。

The routing table is the core of the routing sysbsystem. In its simplest definition, it consists of a database of routes that is available to other subsystems—IPv4, for example—through various functions, the most important being the one used to do lookups.

正如您可能已经想象的那样,路由不仅仅包含“路由器、路由和路由表”部分中显示的基本信息。随着时间的推移,由于代码优化和新功能的引入,构成路由表中条目的信息量已经增加了很多。我们将在第 34 章中查看这些细节。

As you may already imagine, routes do not consist only of the basic information shown in the section "Routers, Routes, and Routing Tables." Over time, due both to code optimizations and to the introduction of new features, the amount of information that makes up an entry in the routing table has grown quite a bit. We will look at those details in Chapter 34.

在以下小节中,我们将简要了解:

In the following subsections, we will briefly see:

  • Linux 如何路由发送至本地地址的数据包

  • How Linux routes packets addressed to local addresses

  • 使用什么算法在路由表中查找地址

  • What algorithm is used to look up addresses in the routing table

  • 除了默认转发操作之外,还可以对与路由匹配的流量应用哪些管理操作

  • What administrative actions can be applied to traffic matching a route besides the default forwarding action

  • 上层协议为了方便起见在路由中存储了哪些额外信息

  • What extra information is stored in a route by upper protocols for their convenience

特别航线

Special Routes

当收到数据包时,路由器需要确定是否将其本地传递到下一个更高层(因为本地主机是最终目的地)或转发它。实现此目的的一个简单方法是将所有本地地址存储在一个列表中,并在路由查找过程中扫描该列表中的每个数据包。当然,列出清单并不是最好的选择。有更好的数据结构可以提供更快的查找时间。Linux 使用单独的基于哈希的路由表,其中仅存储本地地址。更准确地说,它存储它侦听的所有地址,其中包括本地配置的地址和子网广播。

When a packet is received, a router needs to determine whether to deliver it locally to the next-higher layer (because the local host is the final destination) or to forward it. A simple way to accomplish this is to store all the local addresses in a list and scan the list for each packet as part of the routing lookup. Of course, a list would not be the best choice; there are better data structures that can provide faster lookup time. Linux uses a separate hash-based routing table where it stores only local addresses. To be more exact, it stores all of those addresses that it listens to, which includes both the locally configured addresses and the subnet broadcasts.

这意味着默认情况下,Linux 使用两个路由表:

This means that by default, Linux uses two routing tables:

  • 本地地址表。在此表中成功查找意味着数据包将在主机本身上传递。我们将在第 32 章中看到内核如何填充这个表。

  • A table for local addresses. A successful lookup in this table means that the packet is to be delivered on the host itself. We will see in Chapter 32 how the kernel populates this table.

  • 所有其他路由的表,由用户手动配置或由路由协议动态插入。

  • A table for all other routes, manually configured by the user or dynamically inserted by routing protocols.

路由类型和操作

Route Types and Actions

我们在“路由器、路由和路由表”一节中了解了基本路由的组成。默认情况下,处理匹配给定路由的数据包所采取的操作是根据该路由的路由表返回的转发信息进行转发:下一跳路由器和出口设备。

We saw in the section "Routers, Routes, and Routing Tables" what a basic route consists of. By default, the action taken to process a packet that matches a given route is to forward it according to the forwarding information returned from the routing table for that route: the next-hop router and the egress device.

然而,Linux 也允许您有选择地定义其他类型的操作。[ * ]以下是主要内容:

However, Linux allows you to optionally define other kinds of actions as well.[*] Here are the main ones:

黑洞
Black hole

匹配此类路由的数据包将被默默丢弃。

Packets matching this type of route are silently discarded.

无法到达
Unreachable

匹配此类路由的数据包将被丢弃,并生成 Internet 控制消息协议 (ICMP) 主机不可达消息。

Packets matching this type of route are discarded and generate an Internet Control Message Protocol (ICMP) host unreachable message.

禁止
Prohibit

匹配此类路由的数据包将被丢弃并生成 ICMP 数据包过滤消息。

Packets matching this type of route are discarded and generate an ICMP packet filtered message.

Throw

这种类型与策略路由结合使用,策略路由是第 31 章中介绍的功能 。配置策略路由后,匹配该类型的路由将使查找放弃当前表并继续查找下一张(如果有)。

This type is used in conjunction with policy routing, a feature covered in Chapter 31. When policy routing is configured, a matching route of this type will make the lookup abandon the current table and continue with the following one (if any).

路由缓存

Routing Cache

根据路由器所扮演的角色,其路由表中的路由数量可以从几个单元到几十万条。因此,很明显,维护一个较小的表来缓存查找结果(无论是正查找还是负查找)是有益的。Linux分割路由缓存分为两个部分(这里的协议是指 L3 协议,例如 IPv4 和 IPv6):

Depending on the role played by the router, the number of routes in its routing table can range from a few units to a few hundred thousand. Because of that, it should be obvious that it would be beneficial to maintain a smaller table that caches the results of lookups, both positive and negative. Linux splits the routing cache into two components (where a protocol, in this context, means an L3 protocol such as IPv4 and IPv6):

  • 协议相关的缓存

  • A protocol-dependent cache

  • 独立于协议的目标缓存,通常称为 DST

  • A protocol-independent destination cache, often called just DST

第一个组件表示缓存的骨架,其中每个元素都定义为特定于协议的字段的集合。第二个组件嵌入第一个组件中,仅存储与协议无关的信息。协议相关缓存和协议无关组件都在第 33 章中描述 。

The first component represents the skeleton of the cache, where each element is defined as a collection of protocol-specific fields. The second component, which is embedded in the first, stores only protocol-independent information. Both the protocol-dependent cache and the protocol independent component of it are described in Chapter 33.

我们将在第31章中看到,在支持策略路由功能的Linux系统上创建多个独立的路由表是可能的。无论有多少路由表,Linux 都只使用一个路由缓存。如果支持策略路由,则缓存不提供任何公平性,因此一个路由表的路由可能会比其他路由表使用更多的缓存条目(即缓存中的空间在各个路由表之间分配不均等)。路由表)。然而,这种方法总体上可确保更大的路由吞吐量。

We will see in Chapter 31 that it is possible to create multiple independent routing tables on a Linux system that supports the policy routing feature. Regardless of the number of routing tables, Linux uses only one routing cache. If policy routing is supported, the cache does not provide any fairness, so it is possible that the routes of one routing table use many more entries of the cache than other routing tables (i.e., the space in the cache is not equally distributed among the routing tables). This approach, however, ensures greater routing throughput overall.

路由表与路由缓存

Routing Table Versus Routing Cache

路由表和路由缓存除了大小和结构不同之外,其对象的粒度也不同。路由表使用子网、连续地址的聚合。另一方面,缓存的条目与单个 IP 地址相关联。因此,路由表和路由缓存使用的查找算法也不同,正如我们将在“查找”部分中看到的那样。

The routing table and the routing cache, besides differing in size and structure, also differ in the granularity of their objects. The routing table uses subnets, aggregates of consecutive addresses. Entries of the cache, on the other hand, are associated with single IP addresses. Because of this, the lookup algorithm used by the routing table and the routing cache also differs, as we will see in the section "Lookups."

让我们看一个例子。假设我们的路由表除其他路由外还包括表 30-4中的路由,它是唯一通向子网 10.0.1.0/24 的路由。

Let's view an example. Suppose our routing table includes, among other routes, the one in Table 30-4, which is the only one that leads to the subnet 10.0.1.0/24.

表 30-4。路由表条目示例

Table 30-4. Example of routing table entry

目的地

Destination

下一跳

Next hop

使用设备

Device to use

10.0.1.0/24

10.0.1.0/24

10.0.0.1

10.0.0.1

以太网0

eth0

我们还假设内核被要求将两个数据包分别传输到地址 10.0.1.100 和 10.0.1.101。由于表 30-4中的路由在两种情况下都会匹配,因此内核将使用它来路由两个数据包,并将两个条目安装到路由缓存中,类似于表 30-5中的条目。

Let's also suppose the kernel was asked to transmit two packets to the addresses 10.0.1.100 and 10.0.1.101, respectively. Since the route in Table 30-4 would match in both cases, the kernel would use it to route the two packets and would install two entries into the routing cache that would look like those in Table 30-5.

表 30-5。路由缓存条目示例

Table 30-5. Example of routing cache entry

目的地

Destination

下一跳

Next hop

使用设备

Device to use

10.0.1.100

10.0.1.100

10.0.0.1

10.0.0.1

以太网0

eth0

10.0.1.101

10.0.1.101

10.0.0.1

10.0.0.1

以太网0

eth0

当然,表 30-5中的元素是简化版本。在第33章中,我们将看到路由缓存的条目也包含源地址。

The elements in Table 30-5 are a simplified version, of course. In Chapter 33, we will see that entries of the routing cache include the source address, too.

路由缓存垃圾收集

Routing Cache Garbage Collection

垃圾收集负责消除路由子系统拥有的不再使用的数据结构。但是,即使数据结构正在使用,也可能会被删除,例如,为了释放存储更重要的内容所需的内存。垃圾收集所做的删除不会导致任何数据丢失,因为所有删除的信息都可以重新创建。在最坏的情况下,从缓存中删除元素只会导致缓存未命中。

Garbage collection is responsible for eliminating data structures, owned by the routing subsystem, that are no longer in use. However, data structures may be removed even if they are in use, for example, to free the memory needed to store something more important. The effects of the deletions done by the garbage collection will not lead to any loss of data, because all the information eliminated can be re-created. The deletion of an element from the cache can lead only to a cache miss in the worst case.

垃圾收集有两种:

There are two kinds of garbage collection:

同步
Synchronous

当路由子系统发现需要释放一些内存时,会立即进行清理。在两种情况下,路由代码可能会强制进行垃圾回收,而不等待常规计时器执行此操作:

  • 当新条目要添加到路由缓存并且当前缓存中的条目数量已达到可由用户配置的特定阈值时。

  • 当相邻子系统缓存需要内存时。我们在第 27 章中看到,路由缓存和相邻子系统缓存保持彼此的引用。新的路由缓存条目的创建可以触发新的邻居缓存条目的创建。如果相邻协议(例如 ARP)无法分配所需的内存,路由子系统将强制垃圾收集间接释放相邻协议拥有的数据结构,从而帮助后者找到所需的内存。

When the routing subsystem sees the need to free some memory, a cleanup is done right away. There are two cases where the routing code may force garbage collection without waiting for the regular timer to do it:

  • When a new entry is to be added to the routing cache and the number of entries currently in the cache has reached a particular threshold, which is configurable by the user.

  • When memory is needed by the neighboring subsystem cache. We saw in Chapter 27 that the routing cache and the neighboring subsystem cache keep references to each other. The creation of a new routing cache entry could trigger the creation of a new neighbor cache entry. If the neighboring protocol—say, ARP—failed to allocate the memory it needed, the routing subsystem would force a garbage collection to indirectly free data structures owned by the neighboring protocol and therefore help the latter find the memory it needed.

异步
Asynchronous

为了保持缓存大小合理,使用周期性计时器来触发定期清理。默认情况下,路由缓存条目不会过期。然而,外部子系统可以告诉路由缓存在给定的时间后使某些条目过期。路由子系统运行一个定时器,定期扫描缓存,查找符合以下条件的条目:

  • 已过期,应删除

  • 没有过期,但如果内核需要释放一些内存,则可能会被牺牲

To keep the cache size reasonable, a periodic timer is used to trigger regular cleanups. By default, routing cache entries do not expire. However, it is possible for external subsystems to tell the routing cache to expire certain entries after a given amount of time. The routing subsystem runs a timer that periodically scans the cache, looking for entries that:

  • Are expired and should be removed

  • Are not expired, but could be sacrificed if the kernel needs to free some memory

可能使缓存条目过期的事件示例

Examples of events that can expire cache entries

条目仅在特定情况下才会过期,包括:

An entry is set to expire only in specific cases, including:

  • 当本地系统收到 ICMP UNREACHABLE 或 ICMP FRAGMENTATION NEEDED 消息时,会将其交给 ICMP 层。此类消息通知本地主机先前发出的一个数据包,该数据包的大小超过了通往目标地址的路径上路由器的 MTU。ICMP 处理程序将扫描路由缓存,更新所有受影响条目的 PMTU 字段,并将后者设置为在一定的可配置时间(默认情况下为 10 分钟)后过期。

    ICMP 还通知与触发 ICMP 消息的数据包关联的 L4 协议。例如,TCP 可以将这些通知用于路径 MTU 发现算法。有关路径 MTU 发现的更多详细信息,请参阅第 18 章和第 25章。

  • When the local system receives an ICMP UNREACHABLE or ICMP FRAGMENTATION NEEDED message, it hands it to the ICMP layer. Such a message notifies the local host about a packet that it previously sent out whose size exceeded the MTU of a router along the path to the destination address. The ICMP handler will scan the routing cache, update the PMTU field of all the affected entries, and set the latter to expire after a certain configurable amount of time, which is 10 minutes by default.

    ICMP also notifies the L4 protocol associated with the packet that triggered the ICMP message. For instance, TCP may use these notifications for the Path MTU discovery algorithm. See Chapters 18 and 25 for more details on path MTU discovery.

  • 当相邻协议无法解析 L3 到 L2 映射(请参阅第 27 章)或当本地主机位于 IP 隧道的一端而另一端变得无法访问时,目标 IP 地址可被分类为不可达。某些原因(例如,路由问题或配置错误)。

    当目标 IP 地址被分类为不可达时,与该地址关联的缓存中的所有条目都需要被刷新,因此将被设置为立即过期。

  • A destination IP address can be classified as unreachable when the neighboring protocol has failed to resolve the L3-to-L2 mapping (see Chapter 27) or when the local host is at one end of an IP tunnel, and the other end becomes unreachable for some reason (for example, a routing problem or misconfiguration).

    When a destination IP address is classified as unreachable, all the entries of the cache associated with the address need to be flushed and therefore will be set to expire right away.

符合条件的缓存受害者的示例

Examples of eligible cache victims

在某些情况下,内核需要释放一些缓存条目以为新条目腾出空间,并且周期性计时器本身无法保证缓存始终有一些空闲空间(即,将其大小保持在某个阈值以下) )。在这些情况下,主机必须删除周期性计时器不会选择的条目,因为它们仍然有效。即使垃圾收集系统需要从有效条目中选择受害者,它也可以通过选择那些只需少量开销即可快速重新创建的条目来减少损害。

There may be cases where the kernel needs to free some cache entries to make room for new ones, and the periodic timer is not able to guarantee by itself that the cache will always have some free room (i.e., to keep its size below some threshold). In those cases, the host must delete entries that the periodic timer would not pick because they are still valid. Even if the garbage collection system needs to select victims from valid entries, it can reduce the damage by selecting those that can be re-created quickly with only a small overhead.

适合删除的候选者包括到广播和多播地址的路由。通常,当路由子系统删除路由缓存条目时,它也可能间接删除 L3 到 L2 关联。发生这种情况时,下次主机需要向 L3 地址发送数据时,相邻子系统将需要再次解析 L3 到 L2 关联。然而,广播和多播地址可以用较低的开销来解析,因为它们不需要任何请求请求(参见第26章中的“特殊情况”部分)。

Good candidates for removal include routes to broadcast and multicast addresses. Normally, when the routing subsystem deletes a routing cache entry, it may indirectly remove the L3-to-L2 association as well. When this happens, the next time the host needs to send data to the L3 address, the neighboring subsystem will need to resolve the L3-to-L2 association again. However, broadcast and multicast addresses can be resolved with low overhead because they do not need any solicitation request (see the section "Special Cases" in Chapter 26).

特别糟糕(高开销)的删除候选者包括:

Particularly bad (high-overhead) candidates for removal include:

REDIRECT路线
REDIRECT routes

这种路由是通过ICMP REDIRECT消息获知的;如果它被删除,主机将使用次优路由来处理沿该路由的进一步流量。删除该条目也可能会浪费时间,因为主机很可能会收到另一个 ICMP 重定向,而这只会导致重新插入路由。

This kind of route has been learned through an ICMP REDIRECT message; if it is removed, the host will use suboptimal routing for further traffic along that route. Removing the entry may also be a waste of time because the host will most likely receive another ICMP REDIRECT that just leads to reinserting the route.

管理员手动配置的路由
Routes manually configured by the administrator

这些路由是用户通过ip route get 10.0.0.1 monitor等命令要求内核在状态更改时发送通知(通过 Netlink 套接字)的路由。用户可能出于某种原因认为这条路线很重要。有关详细信息,请参阅第 36 章中的表 36-11 。

These are routes that a user, via a command such as ip route get 10.0.0.1 monitor, has asked the kernel to send a notification (via the Netlink socket) when it changes state. The user probably considers this route important for some reason. See Table 36-11 in Chapter 36 for more information.

无论如何,具有非空引用计数的条目永远不会被视为符合删除条件。

In any case, entries with non-null reference counts are never considered eligible for deletion.

查找

Lookups

正如“路由缓存”一节中提到的,Linux 同时使用路由缓存和路由表。图 30-8 总结了路由表查找的步骤。为了保持“路由/传递数据包”简单,它不反映前面“特殊路由”部分中描述的各种路由。

As mentioned in the section "Routing Cache," Linux uses both a routing cache and a routing table. Figure 30-8 summarizes the steps in a routing table lookup. To keep the "Route/deliver packet" simple, it does not reflect the variety of routes described earlier in the section "Special Routes."

路由缓存中的查找基于简单哈希表中的精确匹配。可能更大、更复杂的路由表中的查找基于最长前缀匹配 (LPM) 算法,在下一节中描述。正如我们将在 第 34 章中看到的,路由表被组织为复杂的数据结构网格。这使得 LPM 更快、更容易实现,可以很好地扩展大量路由,并减少可共享数据结构实例的重复。

Lookups in the routing cache are based on an exact match in a simple hash table. Lookups in the potentially much bigger and more complex routing table are based on a Longest Prefix Match (LPM) algorithm , described in the following section. As we will see in Chapter 34, a routing table is organized as a complex mesh of data structures. This makes LPM faster and easier to implement, scales well with a large number of routes, and reduces the duplication of instances of data structures that can be shared.

最长前缀匹配

Longest Prefix Match

如果通往每个目的地只有一条路线,则进行路由查找 会是微不足道的。一旦找到目标子网包含您的目标地址(查找的关键)的路由,您就完成了。然而,路由是一个复杂的话题。无需详细介绍网络拓扑或发生这种复杂性的具体情况,只需说到达同一目的地的多条路由并不罕见。路由之间的重叠可以是从一个地址到整个子网的任何地方。

If there were only one route toward each destination, routing lookups would be trivial. As soon as you found a route whose destination subnet included your destination address—the key of the lookup—you would be done. However, routing is a complex topic. Without going into detail on the network topologies or specific cases where this complexity occurs, suffice it to say that it is not uncommon to have multiple routes to the same destinations. The overlapping between the routes can be anywhere from one address to an entire subnet.

路由查找

图 30-8。路由查找

Figure 30-8. Routing lookup

在多个匹配的情况下,路由算法需要一个规则来确定性地决定应选择哪条符合条件的路由作为最佳候选。这就是 LPM 发挥作用并部分解决问题的地方:最佳路线是最具体的路线。这意味着子网大小最小,或者等效地,网络掩码最长。

In case of multiple matches, the routing algorithm needs a rule to deterministically decide which of the eligible routes should be selected as the best candidate. Here is where LPM comes into play and partially solves the problem: the best route is the most specific one. This means the one with the smallest subnet size, or equivalently, the longest netmask.

让我们看一个例子。假设我们的路由表有两条与目标地址 10.0.0.100 的查找相匹配的路由,如表 30-6所示。

Let's look at an example. Suppose our routing table had two routes that matched the lookup for the destination address 10.0.0.100, as shown in Table 30-6.

表 30-6。路由表示例1

Table 30-6. Routing table example 1

目的地

Destination

下一跳

Next hop

使用设备

Device to use

10.0.0.0/16

10.0.0.0/16

10.0.1.1

10.0.1.1

eth0

eth0

10.0.0.0/24

10.0.0.0/24

10.0.0.1

10.0.0.1

eth1

eth1

由于第二条路由与目标地址有 24 位(共 32 位),而第一个路由只有 16 位,因此第二条路由具有最长的前缀匹配并获胜。像表 30-6中那样定义路由的情况并不罕见,其中一条路由通向另一条路由的地址子集。例如,由于管理或安全原因,这可能是必要的,以不同于网络其余部分的方式路由寻址到特定子网的流量。但这也是配置路由的最简单方法:另一种方法是将 10.0.0.0/16 范围拆分为多个 /24 范围(即从 10.0.0.0/24 到 10.0.255.0/24),因此放置 255 条路由进入路由表,254具有相同的下一跳。这会使路由查找更慢并且 CPU 成本更高。

Since the second route has 24 bits (out of 32) in common with the destination address and the first one has only 16, the second one is said to have the longest prefix match and wins. It is not uncommon to define routes like the ones in Table 30-6, where one route leads to a subset of addresses of another one. This can be necessary, for instance, to route traffic addressed to a specific subnet differently from the rest of the network, due to administrative or security reasons. But it is also the easiest way to configure routing: the alternative would be to split the 10.0.0.0/16 range into multiple /24 ranges (i.e., from 10.0.0.0/24 to 10.0.255.0/24) and therefore put 255 routes into the routing table, 254 with the same next hop. This would make routing lookups slower and more CPU expensive.

然而,当多个路由与给定目的地匹配时,单独的 LPM 不足以确定性地选择一条路由。我们以表30-7中的例子为例。

However, LPM alone is not sufficient to deterministically select one route when multiple ones match a given destination. Let's take the example in Table 30-7.

表 30-7。路由表示例2

Table 30-7. Routing table example 2

目的地

Destination

下一跳

Next hop

使用设备

Device to use

10.0.0.0/16

10.0.0.0/16

10.0.1.1

10.0.1.1

eth0

eth0

10.0.0.0/24

10.0.0.0/24

10.0.0.1

10.0.0.1

eth1

eth1

10.0.0.0/24

10.0.0.0/24

10.0.0.2

10.0.0.2

eth1

eth1

此时,存在两条匹配前缀长度相同的路由。

In this case, there are two routes with the same matching prefix length.

我们将在第 35 章中看到,查找在搜索键中包含服务类型 (TOS):这意味着配置后,TOS 可以用作决胜局。

We will see in Chapter 35 that lookups include the Type of Service (TOS) in the search key: this means that when configured, the TOS can be used as a tie breaker.

当TOS不足以选择路由时,选择优先级较高(优先级值较低)的路由。

When the TOS is not sufficient to select a route, the route with higher priority (lower priority value) is selected.

如果优先级也不足以明确选择一个路由,内核将简单地选择第一个路由。这意味着将到相同目的地且具有相同前缀长度的路由添加到路由表中的顺序很重要。

If the priority is also not sufficient to unequivocally choose one route, the kernel will simply choose the first one. This means that it matters in which order routes to the same destination and with the same prefix length are added to the routing table.

数据包接收与数据包传输

Packet Reception Versus Packet Transmission

路由表用于路由发送的数据包和接收的数据包,因为任一类型都可以在本地传递或转发。但除了路由表的那些明显用途之外,还有其他一些不太明显的用途。

The routing table is used to route both packets that are transmitted and those that are received, because either type may be delivered locally or forwarded. But besides those obvious uses of the routing table, there are others that are less obvious.

图 30-9显示了几个路由表使用的示例。它区分由数据接收(左侧)和数据传输(右侧)触发的查找。请注意,转发输入数据包所需的路由信息​​是在首次接收数据包时收集的,这解释了为什么转发块没有指向图 30-9 中的路由块的箭头。该图还包括指向那些章节的指针,您可以在其中找到有关给定内核组件的更多详细信息。以下是有关所示活动的一些详细信息:

Figure 30-9 shows a couple of examples of routing table use. It distinguishes between lookups triggered by the reception of data (left side) and the transmission of data (right side). Note that the routing information required to forward an input packet is collected when the packet is first received, which explains why the Forwarding block does not have an arrow toward the Routing block in Figure 30-9. The figure also includes pointers to those chapters where you can find more details about a given kernel component. Here are some details concerning the activities shown:

  • 地址解析协议 (ARP) 数据包不会被路由,但 ARP 可能需要进行路由查找以强制执行一些健全性检查。参见第 28 章

  • Address Resolution Protocol (ARP) packets are not routed, but ARP may need to do a route lookup to enforce some sanity checks. See Chapter 28.

  • IP-over-IP 是一种简单的隧道协议,它将 IP 数据包封装在较大的 IP 数据包中。当 IP 处理程序收到入口 IP-over-IP 数据包时,它会将有效负载重新传递到 IP 层。内部 IP 数据包的路由与任何其他入口数据包一样,因此路由子系统需要进行另一次路由查找。

    入口和出口流量路由

    图 30-9。入口和出口流量路由

  • IP-over-IP is a simple tunneling protocol that encapsulates IP packets within larger IP packets. When the IP handler is handed an ingress IP-over-IP packet, it redelivers the payload to the IP layer. The inner IP packet is routed like any other ingress packet, so the routing subsystem needs to make another routing lookup.

    Figure 30-9. Ingress and egress traffic routing

  • 路由数据包通常只需要一次查找,无论数据包源自何处(本地或远程)。该查找返回路由数据包所需的所有信息,包括处理该数据包的内核函数。有一些例外:您可能有一个功能由于某种原因需要进行额外的查找,就像刚才描述的 IP-over-IP 的情况一样。

  • Routing a packet normally requires only one lookup, regardless of where the packet originated (locally or remotely). That lookup returns all the information needed to route the packet, including the kernel functions that will take care of it. There are a few exceptions: you may have a feature that for some reason needs to make additional lookups, as with the case of IP-over-IP just described.

  • 路由核心首先检查缓存是否已经包含所需的信息,否则回退到路由表。

  • The routing core first checks whether the cache already contains the required information, and falls back to the routing table otherwise.




[ * ]与 IPv4 不同,IPv6 通过在 IP 标头中使用特殊标志来明确定义路由器角色。

[*] Unlike IPv4, IPv6 explicitly defines the router role by using a special flag in the IP header.

[ * ] “范围”部分描述了该术语应用于 IP 地址时的确切含义。

[*] The section "Scope" describes the exact meaning of the term when applied to IP addresses.

[ * ]当然也有例外。有关示例,请参阅“定向广播”部分。

[*] There are exceptions, of course. See the section "Directed Broadcasts" for an example.

[ * ]有些拓扑不需要默认网关。有关示例,请参阅第 28 章中的“代理 ARP 服务器作为路由器”部分。

[*] There are topologies where a default gateway is not needed. See the section "Proxy ARP Server as Router" in Chapter 28 for an example.

[ * ]我们将在第36章中看到,只能使用新一代配置工具IPROUTE2来配置这些替代路由类型。

[*] We will see in Chapter 36 that you can configure these alternative route types only using the new-generation configuration tool IPROUTE2.

第 31 章路由:高级

Chapter 31. Routing: Advanced

上一章介绍了基本的路由。本章介绍策略路由等路由特性 多路径可用于在更复杂的场景中配置路由。它还显示了路由如何与负责 QoS 的流量控制子系统以及防火墙代码 (Netfilter) 进行交互。本章最后介绍了两个较小的功能:ICMP 重定向和反向路径过滤。

The previous chapter gave an introduction to basic routing. This chapter introduces routing features such as policy routing and multipath that can be used to configure routing in more complicated scenarios. It also shows how routing interacts with the Traffic Control subsystem in charge of QoS, and the firewall code (Netfilter). The chapter concludes with two smaller features: ICMP redirects and reverse path filtering.

策略路由背后的概念

Concepts Behind Policy Routing

我们在第30章“特殊路由”一节中看到,Linux内核默认使用两个路由表,一个用于本地路由,另一个由管理员配置。当内核编译为支持策略路由时,您最多可以拥有 255 个不同且独立的路由表。在本章中,我们将了解策略路由的用途,在第 35 章中,我们将了解其对路由子系统设计的影响。

We saw in the section "Special Routes" in Chapter 30 that the Linux kernel uses two routing tables by default, one for local routes and one configurable by the administrator. When the kernel is compiled with support for policy routing, you can have up to 255 distinct and independent routing tables. In this chapter, we will see what policy routing can be used for, and in Chapter 35 we will see its implications on the design of the routing subsystem.

策略路由背后的主要思想是允许用户基于更多参数配置路由,而不仅仅是目标 IP 地址。

The main idea behind policy routing is to allow the user to configure routing based on more parameters than just the destination IP addresses.

互联网蓬勃发展了多年,大多数路由器仅配置为根据目标 IP 地址路由数据包。(为了简单起见,我将忽略特定因素,例如跨越 ISP 或国家边界。)并且仅基于目标地址的路由可以(在一些外部配置参数的帮助下)为以下设备提供相当最佳的路由表:情况范围之广令人惊讶。

The Internet thrived for years with most routers configured just to route packets based on the destination IP address. (For the sake of simplicity, I'll leave out particular factors such as crossing ISP or country boundaries.) And basing the route on only the destination address can (with the help of some external configuration parameters) lead to pretty optimal routing tables for a surprisingly wide range of situations.

但商业世界需要考虑许多其他事情,例如出于安全或会计目的分离流量,或者通过单独的路线发送实时流流量。这就是策略路由发挥作用的地方。因为路由的标准多种多样,为了本章的目的,我只会说任何不仅仅基于目标地址的路由都是策略路由。

But the commercial world needs to take many other things into account, such as separating streams of traffic for security or accounting purposes, or sending real-time streaming traffic over a separate route. Here is where policy routing comes into play. Because there are such varied criteria for routing, for the purposes of this chapter I'll just say that any routing based on more than just the destination address is policy routing.

使用策略路由的一个示例是 ISP 根据原始客户或服务质量 (QoS) 要求来路由流量。通常可以通过流量到达 ISP 路由器的端口、源 IP 地址或两者的组合来轻松识别客户。路由器还可以使用源地址和目标地址的组合来识别流量概况或来自给定源的流量聚合。QoS 要求可以从 IP 报头的 DiffServ 代码点 (DSCP) 字段和高层报头字段的组合(这些字段标识应用程序)中得出。

An example of the use of policy routing is for an ISP to route traffic based on the originating customer, or on Quality of Service (QoS) requirements. The customer can often be easily identified by the port on which the traffic arrives at the ISP's router, the source IP address, or a combination of the two. A router can also use a combination of source and destination addresses to identify a profile of traffic or an aggregate of traffic from a given source. The QoS requirements can be derived from the DiffServ Code Point (DSCP) field of the IP header and from a combination of the fields of the higher-layer headers (these identify the applications).

由于本书是关于内核内部的,我们想知道这些策略是如何传递到内核的,看看它们是如何嵌入到路由表中的,并找出它们如何影响路由查找。我们将学习所有这些,但让我们从一个示例开始,使用图 31-1的拓扑作为参考。

Since this book is about kernel internals, we want to know how those policies are passed to the kernel, see how they are embedded in the routing table, and find out how they affect the routing lookups. We will learn all of this, but let's start with an example, using the topology of Figure 31-1 as a reference.

让我们重点关注路由器 RT 的配置,该路由器用于将 Campus 1 和 Campus 2 连接到 Campus 3 和 Internet(我们不必担心路由器如何设法转换不可路由的地址 10.0.xx ;这只是一个示例)。我们还假设我们想要执行以下两项策略:

Let's focus on the configuration of the router RT, which is used to connect Campus 1 and Campus 2 to both Campus 3 and the Internet (let's not bother about how the routers manage to translate the nonroutable addresses 10.0.x.x; this is just an example). Let's also suppose we want to enforce the following two policies:

  • 定向到园区 3 的流量在源自园区 1 时将经过路由器 RT1,而在源自园区 2 时将经过路由器 RT2。原因之一可能是园区 2 的管理员愿意支付更多费用,因此允许使用速度更快的网络将 RT 连接到 Campus 3。

  • Traffic directed to Campus 3 will go through router RT1 when originated from Campus 1, and through RT2 when originated from Campus 2. One reason could be that the administrator of Campus 2 is willing to pay more and therefore is allowed the use of the faster network that connects RT to Campus 3.

  • 定向到互联网(例如,除三个园区之外的任何目的地)的流量将通过园区 1 的 DG1(默认网关 1),并通过园区 2 的 DG2。这可能是强制执行安全或带宽策略所必需的。

  • Traffic directed to the Internet (e.g., any destination except the three campuses) will go through DG1 (default gateway 1) for Campus 1, and through DG2 for Campus 2. This could be needed, perhaps, to enforce security or bandwidth policies.

此示例是一个简单的示例,只有几条路由和两个策略。当然,提供独立路由表的优势只有在更大、更复杂的场景中才会显现出来。即使这个例子也是不完整的——例如,我们忽略了从互联网到校园的传入路由。

This example is a simple one with just a few routes and only two policies. Of course, the advantage of providing independent routing tables appears only in much bigger and more complex scenarios. And even this example is incomplete—we are ignoring, for instance, incoming routes from the Internet to the campuses.

在路由器 RT 上配置路由有两种可能的方法,其中一种(多表)是 Linux 使用的方法。

There are two conceivable ways to configure routing on router RT, one of which (multiple tables) is the approach used by Linux.

单表方法
Single table approach

表 31-1是 RT 上配置的路由表的简化版本,用于实施前面列出的两个策略。请注意,由于 Campus 1 是唯一连接到 RT eth0的网络,而 Campus 2 是唯一连接到 RT eth1的网络,因此路由不需要指定源 IP 地址。不是仅根据目标地址进行路由(因为同一目标地址可以匹配多个路由),而是检查多个条件来选择唯一的路由。在这种情况下,传入设备将与目标地址一起检查。

可能需要策略路由的拓扑示例

图 31-1。可能需要策略路由的拓扑示例

表 31-1。单个路由表示例

入口设备

源IP

目的IP

下一跳

出口装置

源自园区 1 的流量到园区 2 和 3 的路由

    

eth0

未指定

10.0.3.0/24

10.0.0.10 (RT1)

eth2

eth0

未指定

10.0.2.0/24

未指定

eth1

源自园区 2 的流量到园区 1 和 3 的路由

    

eth1

未指定

10.0.3.0/24

10.0.0.20 (RT2)

eth2

eth1

未指定

10.0.1.0/24

未指定

eth0

园区1和园区2的默认路由

    

eth0

未指定

0.0.0.0/0

10.0.0.11 (DG1)

eth2

eth1

未指定

0.0.0.0/0

10.0.0.21(DG2)

eth2

Table 31-1 is a simplified version of the routing table configured on RT to enforce the two policies listed earlier. Note that because Campus 1 is the only network connected to RT's eth0, and Campus 2 is the only one connected to RT's eth1, the routes do not need to specify the source IP addresses. Instead of routing just on the destination address—because the same destination address can match multiple routes—multiple criteria are checked to choose a unique route. In this case, the incoming device is checked along with the destination address.

Figure 31-1. Example of topology that may require policy routing

Table 31-1. Single routing table example

Ingress device

Source IP

Destination IP

Next hop

Egress device

Routes to Campuses 2 and 3 for traffic originated in Campus 1

    

eth0

Not specified

10.0.3.0/24

10.0.0.10 (RT1)

eth2

eth0

Not specified

10.0.2.0/24

Not specified

eth1

Routes to Campuses 1 and 3 for traffic originated in Campus 2

    

eth1

Not specified

10.0.3.0/24

10.0.0.20 (RT2)

eth2

eth1

Not specified

10.0.1.0/24

Not specified

eth0

Default routes for Campus 1 and Campus 2

    

eth0

Not specified

0.0.0.0/0

10.0.0.11 (DG1)

eth2

eth1

Not specified

0.0.0.0/0

10.0.0.21 (DG2)

eth2

多表方法
Multiple table approach

由于每次路由查找中都可能涉及如此多的条件,因此主机可以更快、更轻松地维护独立的路由表并从特定条件中选择正确的路由表。例如,源 IP 地址或入口设备可用于选择路由表,并且该表可以包含更多标准来帮助进行最终路由选择。

因此,当使用多个路由表时,内核必须在进行查找之前选择正确的路由表——路由表的选择是 策略生效的地方。因此,我们示例中的路由器 RT 规则可以通过以下规则结合表 31-231-3来确定:

Because so many criteria could potentially be involved in every route lookup, it's faster and easier for a host to maintain independent routing tables and choose the right one from particular criteria. For instance, the source IP address or the ingress device could be used to choose a routing table, and that table could contain more criteria to help make the final route selection.

Thus, when multiple routing tables are used, the kernel has to select the right routing table before it can do a lookup—the choice of a routing table is where policies come into effect. Thus, the rules for router RT in our example could be determined by the following rules, in conjunction with Tables 31-2 and 31-3:

  • Traffic coming in on eth0 is checked against Routing Table 1 (Table 31-2).

  • Traffic coming in on eth1 is checked against Routing Table 2 (Table 31-3).

表 31-2。RT1 用于路由来自 Campus 1 的流量

Table 31-2. RT1 used to route traffic from Campus 1

目的IP

Destination IP

下一跳

Next hop

出口装置

Egress device

10.0.2.0/24

10.0.2.0/24

没有任何

None

以太网1

eth1

10.0.3.0/24

10.0.3.0/24

10.0.0.10 (RT1)

10.0.0.10 (RT1)

以太坊2

eth2

0.0.0.0/0

0.0.0.0/0

10.0.0.11 (DG1)

10.0.0.11 (DG1)

以太坊2

eth2

表 31-3。RT2 用于路由来自园区 2 的流量

Table 31-3. RT2 used to route traffic from Campus 2

目的IP

Destination IP

下一跳

Next hop

出口装置

Egress device

10.0.1.0/24

10.0.1.0/24

没有任何

None

以太网0

eth0

10.0.3.0/24

10.0.3.0/24

10.0.0.20 (RT2)

10.0.0.20 (RT2)

以太坊2

eth2

0.0.0.0/0

0.0.0.0/0

10.0.0.21(DG2)

10.0.0.21 (DG2)

以太坊2

eth2

表 31-231-3中的第一项不需要显式配置,因为内核分别从接口eth0eth1的配置中派生它。我们将在第 32 章中看到这是如何实现的。

The first entry in Tables 31-2 and 31-3 does not need to be explicitly configured because the kernel derives it from the configuration of interfaces eth0 and eth1, respectively. We will see how this is achieved in Chapter 32.

正如我们将在第 33 章中看到的,Linux只维护一个路由缓存,由所有路由表更新。这些表还共享用于分配表构建块的内存池。Linux 不强制执行任何公平机制来在各个路由表之间公平地共享这些公共资源。除了简化实现之外,这实际上还最大化了总体路由吞吐量,因为更多的系统资源被分配给需求较高的路由表。然而,它可能具有外部可检测到的效果:即当一台使用不同路由表的 Linux 主机管理来自不同来源的流量时,

As we will see in Chapter 33, Linux maintains only one routing cache that is updated by all the routing tables. These tables also share the memory pools used to allocate the building blocks of the tables. Linux does not enforce any fairness mechanism to share these common resources equitably among the various routing tables. In addition to simplifying the implementation, this actually maximizes overall routing throughput, because more system resources are allocated to the routing tables with higher needs. However, it may have an externally detectable effect: that is, when one Linux host using different routing tables manages traffic from different sources, the overall experience from a customer perspective may be different from the experience provided by independent routers or even by a single host that stringently separates the resources used by routes.

使用策略路由查找

Lookup with Policy Routing

当使用策略路由时,目标的查找包括两个步骤:

When policy routing is in use, a lookup for a destination consists of two steps:

  1. 根据配置的策略确定要使用的路由表。这项额外的任务不可避免地会增加路由查找时间。

  2. Identify the routing table to use, based on the policies configured. This extra task inevitably increases routing lookup times.

  3. 对选定的路由表进行查找。

  4. Do a lookup on the selected routing table.

当然,在执行这两个步骤之前,内核总是会尝试路由缓存。

Of course, before taking these two steps, the kernel always tries the routing cache.

可以为策略分配管理类型,例如路由(请参阅第 30 章中的“路由类型和操作”部分)。这允许内核根据分配给整个策略的类型做出快速决策,而无需等待查找路由。例如,当匹配策略配置为UNREACHABLE类型时,内核会生成ICMP HOST UNREACHABLE消息,而不是等待并查找配置为UNREACHABLE类型的匹配路由。

Policies can be assigned an administrative type, like routes (see the section "Route Types and Actions" in Chapter 30). This allows the kernel to make a quick decision based on a type assigned to an entire policy, without waiting to look up the route. For example, the kernel generates an ICMP HOST UNREACHABLE message when the matching policy is configured with an UNREACHABLE type, instead of waiting and finding a matching route configured with an UNREACHABLE type.

图31-2是第30章30-9的修订版,增加了对策略路由的支持以及可选策略类型的详细信息。

Figure 31-2 is a revised version of Figure 30-9 in Chapter 30, with added support for policy routing and details about the optional policy types.

路由表选择

Routing Table Selection

让内核选择要使用的路由表的策略可以基于以下参数:

The policies that let the kernel select the routing table to use can be based on the following parameters:

源和/或目标 IP 地址
Source and/or destination IP address

可以指定源 IP 地址和目标 IP 地址,每个地址都带有网络掩码。

It is possible to specify both the source IP address and the destination IP address, each with a netmask.

入口设备
Ingress device

根据上下文,接收设备可能是比源 IP 地址更合适的路由策略标准。在某些情况下,具有一个源 IP 地址的数据包可能会到达多个接口,但我们希望配置基于接收设备,例如,一台设备上的流量是否被视为实时且具有更高优先级。在这种情况下,源IP地址就没有多大帮助了。在这些情况下,使用设备而不是源 IP 地址也可能更好:

策略路由查找

图 31-2。策略路由查找

  • 当多个不连续的源 IP 地址范围位于同一设备上时,我们希望将其与同一路由表关联。在这种情况下,您可以根据设备使用单个规则来简化配置,而不是为每个不同的 IP 地址范围添加规则。

  • 当路由表的选择更多地与物理网络拓扑相关而不是与流量源相关时。

Depending on the context, the receiving device can be a more appropriate criterion for routing policy than the source IP address. There are cases where a packet with one source IP address could arrive on more than one interface, but we would like the configuration to be based on the receiving device—for instance, if traffic on one device was considered real time and higher priority. In that case, the source IP address would not be of much help. The use of the device rather than of the source IP address could be preferable in these cases as well:

Figure 31-2. Policy routing lookup

  • When multiple, discontinuous ranges of source IP addresses are on the same device that we want to associate with the same routing table. In this case, instead of adding a rule for each distinct range of IP addresses, you can simplify the configuration by using a single rule based on the device.

  • When the selection of the routing table has more to do with the physical network topology than with the source of the traffic.

服务条款
TOS

与基于流量的源和目的地的参数相反,该参数的使用可以帮助对流量类型(例如,批量数据、交互式等)进行分类。

The use of this parameter can help in classifying the type of traffic (e.g., bulk data, interactive, etc.), as opposed to the parameters based on the source and destination of the traffic.

弗马克
Fwmark

这是显示 Linux 防火墙强大功能的功能之一。策略路由规则可以根据防火墙分类来定义。当然,要实现这一点,防火墙必须在路由之前对流量进行分类。请参阅“策略路由和基于防火墙的分类器”部分。

This is one of the features that shows the power of Linux firewalling. Policy routing rules can be defined in terms of firewall classification. Of course, for this to be possible, the firewall has to classify traffic before routing comes into the picture. See the section "Policy Routing and Firewall-Based Classifier."

上述参数的任意组合也代表了确定策略的有效方法。

Any combination of the preceding parameters also represents a valid way to determine the policy.

多路径路由背后的概念

Concepts Behind Multipath Routing

多路径是一项功能,允许管理员为给定路由的目的地指定多个下一跃点。在具有大量要求的环境中,这样做有几个原因。路由器大部分时间只能使用一个 ISP,当第一个 ISP 由于某种原因出现故障时,可以切换到另一个 ISP。多路径的另一个应用是使路径保持待机状态,并仅在带宽需求超过预定义阈值时才启用它。

Multipath is a feature that allows an administrator to specify multiple next hops for a given route's destination. In environments with substantial requirements, there are several reasons for doing this. A router could just use one ISP most of the time, and switch to the other when the first one fails for some reason. Another application of multipath is to keep a path on standby and enable it only when bandwidth requirements surpass a predefined threshold.

图 31-3显示了左侧网络通过路由器 RT 连接到互联网的拓扑,该路由器配置为通过两个不同的 ISP 同时使用两个上行链路。

Figure 31-3 shows a topology where the network on the left is connected to the Internet via router RT, which is configured to use two uplinks simultaneously via two different ISPs.

多路径拓扑示例

图 31-3。多路径拓扑示例

Figure 31-3. Example of topology with multipath

假设我们希望 RT 使用 RT1 和 RT2 作为默认网关,使它们始终可用。在 RT 上,我们可以定义多路径路由,只需为路由提供多个下一跳即可。以下用户空间命令使用较新的 IPROUTE2 软件包,将启用多路径:

Let's suppose we want to have RT use both RT1 and RT2 as default gateways, keeping them always available. On RT, we could define a multipath route, simply by providing the route with more than one next hop. The following user-space command, using the newer IPROUTE2 package, would enable multipath:

ip 路由添加默认范围全局 nexthop via 100.100.100.1 权重 1 nexthop via
200.200.200.1重量2
ip route add default scope global nexthop via 100.100.100.1 weight 1 nexthop via 
200.200.200.1 weight 2

请注意,即使该路由包含多个下一跳,该路由仍被视为单个路由。因此,给定一个具有多个下一跳的路由(在我们的示例中为默认路由 0.0.0.0/0),内核需要一种机制来在每次路由与路由查找匹配时选择要使用的下一跳。有不同的方法可以做到这一点,每种方法都有其优点和缺点。要对最常见的多路径路由算法进行有趣的分析,我建议您阅读 RFC 2991 和 2992。

Note that even if the route includes multiple next hops, the route is still considered a single route. Therefore, given a route (in our example, the default route 0.0.0.0/0) with more than one next hop, the kernel needs a mechanism to select the next hop to use each time the route matches a route lookup. There are different ways to do that, each one with its pros and cons. For an interesting analysis of the most common algorithms for multipath routing, I suggest you read RFCs 2991 and 2992.

Linux 允许管理员使用关键字为每个下一跳分配权重,从而提供了算法之间的灵活性weight。选择下一跳的次数与其相对于所有其他下一跳的权重成正比。如果所有下一跳都分配相同的权重,则该算法将退回到所谓的等成本多路径算法。

Linux provides flexibility among algorithms by allowing the administrator to assign each next hop a weight with the weight keyword. The number of times a next hop is selected is proportional to its weight in relation to all the other next hops. If all the next hops are assigned the same weight, the algorithm falls back to the so-called equal cost multipath algorithm.

但请注意,用于在下一跳之间分配流量的粒度不是以数据包来衡量,而是以路由缓存条目的数量来衡量。这是因为一旦选择了下一跳,就会将一个条目添加到缓存中。由于路由子系统在对路由表进行任何检查之前始终会查阅缓存,因此属于同一流量流(流量聚合)的后续数据包将直接从缓存进行处理。正如第 36 章中所解释的,流是匹配一组标准的数据包的集合。这些主要包括源或目标地址、入口或出口设备以及 IP TOS 字段。您将在“每流、每连接和每数据包分发”,当启用缓存的多路径支持时,流量还可以基于每个连接而不是每个流进行分发。

Note, however, that the granularity used to distribute traffic among the next hops is measured not in packets, but in the number of routing cache entries. This is because once a next hop is selected, an entry is added to the cache. Because the routing subsystem always consults the cache before invoking any check on routing tables, subsequent packets belonging to the same traffic flow (aggregate of traffic) will be handled straight from the cache. As explained in Chapter 36, a flow is a collection of packets that match a set of criteria. These consist mainly of the source or destination addresses, the ingress or egress devices, and the IP TOS field. You will see in the section "Per-Flow, Per-Connection, and Per-Packet distribution" that when Multipath support for the cache is enabled, traffic can also be distributed on a per-connection basis instead of on a per-flow basis.

纯粹从吞吐量的角度来看,这种粒度可能不是最佳的,因为不同的流可能有非常不同的带宽要求,因此即使所有下一跳都配置了相同的权重,内核也可能是不公平的,更糟糕​​的是,不公平现象不会是确定性的。因此,Linux 提供了一个选项,允许您使用每个数据包而不是每个流的粒度(请参阅“均衡器算法”部分)。然而,在大多数情况下,考虑到通常穿过路由器的大量流,平均而言,下一跳可能会获得与其权重成正比的负载。

From purely a throughput point of view, this granularity may be suboptimal, because different flows may have very different bandwidth requirements, and therefore the kernel may be unfair even when all of the next hops are configured with the same weight—and what is worse, the unfairness would not be deterministic. So Linux provides an option that allows you to use per-packet rather than per-flow granularity (see the section "Equalizer algorithm"). However, in most cases, given the high number of flows that usually traverse a router, the next hops are likely to get, on average, a load that is proportional to their weights.

下一跳选择

Next Hop Selection

下一跳的选择基于加权循环算法。

The selection of the next hop is based on a weighted round-robin algorithm.

我们在上一节中看到了一个示例用户空间命令,它指定了每个下一跳的权重。通常,管理员会为每条路径分配一个权重,以指示它是否是首选路径。这是循环算法使用的权重。用于定义权重的方法是一个基于带宽和成本等标准的管理问题,因此我不会详细介绍它。

We saw in the previous section a sample user-space command that specified a weight for each next hop. Usually, an administrator assigns a weight to each path to indicate whether it is preferred. That is the weight used by the round-robin algorithm. The method used to define the weight is an administrative issue based on criteria such as bandwidth and cost, so I will not go into detail about it.

与权重成比例地选择下一跳的最简单方法是简单地让每个跳一一消耗其代币,然后重新启动。例如,如果我们有两个权重分别为 3 和 5 的下一跳,我们可以选择第一个跳 3 次,第二个跳 5 次,然后再选择第一个跳 3 次,依此类推。但是流量的分布这种方法可能过于突发。

The easiest way to select the next hops, proportionally to their weights, would be to simply have each one consume its tokens one by one and then restart. For instance, if we had two next hops with weights of 3 and 5, respectively, we could select the first one three times, the second one five times, and then again the first one three times, etc. But the distribution of traffic with this approach could be too bursty.

因此,Linux 在下一跳的选择中添加了随机性成分。给定第i个下一跳的权重Wi,并给定所有下一跳的总和W,Linux随机选择下一跳 W次,并且每个下一跳被选择的次数等于其权重W i。引入的随机性不太准确,但它是一个可以接受的近似值,并且当所有下一跳被分配相同的权重 1 时,它会退回到简单的顺序选择(从第一个下一跳到最后一个下一跳)。

Therefore, Linux adds a randomness component to the selection of the next hop. Given the weight Wi for the ith next hop, and given the sum W of all the next hops, Linux selects a next hop randomly W times, and each next hop is selected a number of times equal to its weight W i. The randomness introduced is not too accurate, but it is an acceptable approximation, and it falls back to a simple sequential selection (from first to last next hop) when all the next hops are assigned the same weight 1.

下面是它的实现方式。内核将循环预算定义为所有下一跳权重的总和。每个下一跳的预算(令牌数量)被初始化为其权重值。在每一轮中,内核都会生成一个范围从 0 到总循环预算的随机值。然后它浏览下一跳列表,直到找到预算大于或等于生成的随机值的列表。每次选择下一跳后,它都会减少循环预算和所选下一跳的预算。

Here is how it is implemented. The kernel defines the round-robin budget as the sum of all the next hop weights. The budget (number of tokens) of each next hop is initialized to the value of its weight. At each round, the kernel generates a random value ranging from 0 to the total round-robin budget. Then it browses the next-hop list until it finds one with a budget greater than or equal to the generated random value. After each next-hop selection, it decrements both the round-robin budget and the selected next hop's budget.

请注意,第一轮中可能没有下一跳匹配。想象一下具有三个下一跃点的情况,其权重分别为 1、2 和 3。总预算为 6。有效的随机值是 0 到 5 范围内的值。但是,值 4 和 5 不会选择任何下一跃点,因为没有人有那么大的预算。发生这种情况时,内核会从总预算中减去每个不匹配的下一跳的权重,然后再次检查。

Note that it is possible for none of the next hops to match on the first round. Imagine a case with three next hops whose weights are 1, 2, and 3. The total budget would be 6. Valid random values are the ones in the range 0 to 5. However, values 4 and 5 would not select any next hop because none has a budget that big. When this happens, the kernel subtracts the weight of each nonmatching next hop from the total budget and checks again.

让我们继续我们的示例来展示它是如何工作的。假设我们的随机数是 5。我们开始浏览下一跳列表。第一个预算为 1,不够。因此,我们不选择它,并通过将随机值降低到 5-1 或 4 来减少对下一跳的要求。下一跳的预算为 2,这又是不够的。因此,我们再次将随机值降低到 4-2,即 2。最后一个下一跳的预算为 3,大于或等于 2,因此被选择。顺便说一句,就性能而言,这是最糟糕的情况:下一跳中的最后一跳是选定的。

Let's continue our example to show how this works. Suppose our random number was 5. We start browsing the list of next hops. The first one has a budget of 1, which is not sufficient. We therefore do not select it, and reduce our requirement from the following next hop by lowering the random value to 5-1, or 4. The following next hop has a budget of 2, which again is not sufficient. So we lower the random value again to 4-2, or 2. The last next hop has a budget of 3, which is greater than or equal to 2 and therefore is selected. This, by the way, is the worst case in terms of performance: the last of the next hops is the one selected.

多路径路由的下一跳可能暂时不可用(请参见第 35 章“多路径对下一跳选择的影响”部分)。当然,下一跳选择算法不会考虑这些。

Next hops of a multipath route can be temporarily unavailable (see the section "Effects of Multipath on Next Hop Selection" in Chapter 35). These, of course, will not be taken into consideration by the next-hop selection algorithm.

多路径缓存支持

Cache Support for Multipath

缺省情况下,路由缓存不支持多路径。因此,正如我们在“多路径路由背后的概念”部分中看到的,一旦“下一跳选择”部分中的算法选择了下一跳之一,它将用于匹配相同查找键的所有后续流量,因为路由通过对下一跳的引用添加到缓存中。

By default, the routing cache does not support multipath. Therefore, as we saw in the section "Concepts Behind Multipath Routing," once the algorithm in the section "Next Hop Selection" has chosen one of the next hops, it will be used for all subsequent traffic matching the same lookup key because a route is added to the cache with a reference to that next hop.

从版本 2.6.12 开始,Linux 内核附带了一个选项,允许用户启用缓存的多路径支持,还允许系统管理员选择使用什么算法在给定指定的不同下一跳之间分配流量。多路径路由。

Starting with version 2.6.12, the Linux kernel comes with an option that allows the user to enable multipath support for the cache, and also allows the system administrator to select what algorithm to use to distribute traffic between the different next hops specified by a given multipath route.

以下是可用的算法:

Here are the available algorithms:

随机的
Random

要使用的下一跳是随机选择的。这很快,因为它不需要保留任何状态信息。平均而言,它在所有下一跳上平均分配流量。

The next hop to use is selected randomly. This is fast because it does not require any state information to be kept. On average, it distributes traffic equally on all next hops.

加权随机
Weighted random

下一跳被分配一个权重,并且流量按照其权重的比例随机分配到所有下一跳。

Next hops are assigned a weight, and traffic is distributed randomly to all next hops proportionally to their weights.

循环赛
Round robin

标准循环算法,将每个传输按顺序分配到下一个路由。

Standard round-robin algorithm, distributing each transmission to the next route in order.

设备循环
Device round robin

流量不是根据路由分配流量,而是在接口上以循环方式分配流量。共享同一设备的多个下一跳被视为一个单元。

Instead of distributing traffic based on the routes, traffic is distributed in round-robin fashion on the interfaces. Multiple next hops sharing a common device are considered one unit.

当您使用IPROUTE2的iproute命令配置路由时,您可以使用newmpath 关键字来选择要使用的算法。这是配置为使用循环算法的路由示例:

When you configure a route with IPROUTE2's ip route command, you can use the new mpath keyword to select the algorithm to use. This is an example of a route configured to use the round-robin algorithm:

ip Route add 10.0.1.0/24 mpath rr nexthop via 192.168.1.1 权重 1
                                  nexthop 通过 192.168.1.2 权重 2
ip route add 10.0.1.0/24 mpath rr nexthop via 192.168.1.1 weight 1
                                  nexthop via 192.168.1.2 weight 2

如果mpath未提供关键字,则路由上的多路径缓存将保持禁用状态。

When the mpath keyword is not provided, multipath caching is kept disabled on the route.

加权随机和设备循环算法将在下一小节中更详细地描述。

The weighted random and device round-robin algorithms are described in more detail in the next subsections.

加权随机算法

Weighted random algorithm

假设我们有一条具有四个下一跳的多路径路由,分配了权重 1、1、2 和 4。让我们沿一条线对齐这四个权重,如图 31-4所示。权重之和为 8,因此如果生成 0-8 范围内的随机数,则可以明确识别线路中的下一跳。例如,值 2.8 将选择第三个下一跳。应该清楚的是,下一跳的选择与其权重成比例。

Assume we have a multipath route with four next hops, assigned the weights 1, 1, 2, and 4. Let's align the four weights along a line as shown in Figure 31-4. The sum of the weights is 8, so if you generate a random number in the range 0-8 you can unequivocally identify a next hop in the line. For example, the value 2.8 would select the third-next hop. It should be clear that the next hops are selected proportionally to their weights.

加权随机选择的示例

图 31-4。加权随机选择的示例

Figure 31-4. Example of weighted random selection

设备循环算法

Device round-robin algorithm

多路径路由的下一跳可以通过单个设备到达,每个下一跳可以通过不同的设备到达,或者可以有混合情况。这三种情况如图31-5所示。

The next hops of a multipath route can be reachable through a single device, each can be reachable through a different device, or you can have a hybrid situation. These three cases are shown in Figure 31-5.

纯循环算法会将流量平均分配到各个下一跃点,但不一定平均分配到与这些下一跃点关联的各个设备。例如,具有三个下一跳的多路径路由,其中​​两个共享同一出口设备,如图31-5(c)所示,两个出口设备之一的负载将是另一个的两倍。

A pure round-robin algorithm would distribute traffic equally to the various next hops, but not necessarily equally to the various devices associated with those next hops. For example, a multipath route with three next hops, two of which share the same egress device, as in Figure 31-5(c), would load one of the two egress devices twice as much as the other.

因此,设备循环算法的目标是在设备池之间平均分配流量,而不是基于每个多路径路由。与配置为使用此算法的任何路由匹配的所有流量都被视为单个流量聚合,以便在设备之间平均分配。因此,关于给定多路径路由使用哪个设备的决定不仅取决于先前用于路由具有相同多路径路由的流量的设备,而且还取决于其他多路径路由所使用的设备。

Thus, the goal of the device round-robin algorithm is to distribute traffic equally among a pool of devices, instead of on a per-multipath-route basis. All traffic that matches any route configured to use this algorithm is considered a single aggregate of traffic to distribute equally among devices. Therefore, the decision concerning which device to use for a given multipath route depends not only on the devices previously used to route traffic with the same multipath route, but also on the devices used by other multipath routes.

请注意,虽然纯循环算法假设转发路径中的瓶颈是目标路由器的 CPU,但设备循环算法旨在优化设备带宽的使用,从而降低目标 CPU 的重要性。

Note that while a pure round-robin algorithm assumes that the bottleneck in the forwarding path is the target routers' CPUs, device round robin aims at optimizing the use of the devices' bandwidths, giving less importance to the target CPUs.

为接口分配下一跳的不同方法

图 31-5。为接口分配下一跳的不同方法

Figure 31-5. Different ways to assign next hops to interfaces

按流、按连接和按数据包分发

Per-Flow, Per-Connection, and Per-Packet Distribution

给定多路径路由,匹配该路由的流量可以在每个流、每个连接或每个数据包的基础上在下一跳之间分配:

Given a multipath route, traffic matching the route could be distributed between the next hops on a per-flow, per-connection, or per-packet basis:

每流
Per flow

为源 IP 地址和目标 IP 地址的每个唯一组合选择要使用的下一跃点。因此,同一对主机之间的多个连接只需要一个选择。

The next hop to be used is selected for each unique combination of source and destination IP addresses. Therefore, multiple connections between the same pair of hosts would require only one selection.

每个连接
Per connection

每次启动新连接时都会选择要使用的下一跳。这意味着同一对主机之间的多个连接可以分布在多个下一跳上。连接通常由源 IP、目标 IP、L4 协议、源 L4 端口和目标 L4 端口的 5 元组来标识。

The next hop to be used is selected every time a new connection is started. This means that multiple connections between the same pair of hosts can be distributed over multiple next hops. A connection is typically identified by the 5-tuple of source IP, destination IP, L4 protocol, source L4 port, and destination L4 port.

每包
Per packet

为每个数据包选择要使用的下一跳。属于同一连接的数据包可以分布在多个下一跳上。

The next hop to be used is selected for each packet. Packets that belong to the same connection can be spread over multiple next hops.

面向连接的协议(例如 TCP)需要按连接和按流分发才能正常工作,但按数据包分发可以很好地与无连接协议(例如用户数据报协议 (UDP))配合使用。

Per-connection and per-flow distribution are needed for connection-oriented protocols such as TCP to work correctly, but per-packet distribution could work well with connectionless protocols such as the User Datagram Protocol (UDP).

当不支持多路径缓存时,Linux 始终根据流的权重按比例在多路径路由的不同下一跃点之间分配流量,如“多路径路由背后的概念”部分所示

When there is no support for Multipath caching, Linux always distributes traffic between the different next hops of a Multipath route on a per-flow basis proportionally to the weights of the flows, as seen in the section "Concepts Behind Multipath Routing."

启用多路径缓存后,流量的分布会根据流量的来源而有所不同:

When multipath caching is enabled, traffic is distributed differently depending on where it originates:

本地产生的流量
Locally generated traffic

使用“多路径缓存支持”部分中列出的算法之一按连接分配流量。

Traffic is distributed on a per-connection basis using one of the algorithms listed in the section "Cache Support for Multipath."

需要转发的入口流量
Ingress traffic that needs to be forwarded

流量的分配就像不支持多路径缓存一样:始终使用第一个匹配的缓存路由。这对于降低任何给定连接的 IP 数据包无序到达目标主机的可能性是必要的。

Traffic is distributed as if there was no support for multipath caching: the first matching cached route is always used. This is necessary to reduce the likelihood that IP packets of any given connection will reach the destination host out of order.

均衡器算法

Equalizer algorithm

Linux 内核有时会提供按数据包进行分发的功能,称为均衡。下面是一个通过选项请求均衡路由的命令示例(当前 Linux 内核中未实现)eql

The Linux kernel at times has offered per-packet distribution, called equalization. Here is an example of a command (not implemented in the current Linux kernel) that asks for an equalized route through the eql option:

## ip route add eql 100.100.100.0/24 nexthop via 10.0.0.2 nexthop via 10.0.0.3
_ip route list
100.100.100.0/24 均衡
    nexthop 通过 10.0.0.2 dev eth0 权重 1
    nexthop 通过 10.0.0.3 dev eth0 权重 1
...
# ip route add eql 100.100.100.0/24 nexthop via 10.0.0.2 nexthop via 10.0.0.3
# ip route list
100.100.100.0/24 equalize
    nexthop via 10.0.0.2  dev eth0 weight 1
    nexthop via 10.0.0.3  dev eth0 weight 1
...

考虑到自从宣布对该选项的支持以来已经过去了很长时间,它不太可能很快添加,可能是因为没有必要。

Given how much time has passed since support for this option was announced as being in the works, it is not likely to be added anytime soon, probably because there is no need for it.

与其他内核子系统的交互

Interactions with Other Kernel Subsystems

在数据包出现在系统中的时间(因为它是在一个接口上接收的或在本地生成的)与它被传递到下一跳(如果转发)或本地(如果寻址到本地主机)的时间之间,有几个网络子系统可能会把手放在上面。其中包括防火墙和流量控制子系统。它们都可以基于各种信息数据库对流量进行分类,并将分类结果存储到缓冲区描述符的字段中。路由子系统代码还可以对流量进行分类并将结果存储在缓冲区描述符中。

Between the time a packet makes its appearance in the system, because it was either received on one interface or generated locally, and the time it is delivered to the next hop (if forwarded) or locally (if addressed to the local host), several network subsystems may place their hands on it. Among them are the Firewall and Traffic Control subsystems. Both of them can classify traffic based on various databases of information and store the result of their classification into a field of the buffer descriptor. The routing subsystem code can also classify traffic and store the result in the buffer descriptor.

图 31-6是路由、防火墙和流量控制如何交互以及何时出现给定子系统的简化概述。该图显示了输入数据包如何通过各个子系统并初始化其防火墙和路由标记。

Figure 31-6 is a simplified overview of how routing, Firewall, and Traffic Control interact, and when a given subsystem comes into the picture. The figure shows how an input packet goes through the various subsystems and gets its firewall and routing tags initialized.

在接下来的小节中,我们将仔细研究策略路由和 防火墙计算它们的标签并使它们可供其他内核子系统使用。

In the next subsections, we will take a closer look at how policy routing and Firewall compute their tags and make them available to other kernel subsystems for use.

基于路由表的分类器

Routing Table Based Classifier

流量控制子系统可以使用的众多分类器中,有一个称为基于路由表的分类器 可以根据领域对路由进行分类。领域是可以分配给策略和路由的数字标签。每个路由和策略最多可以分配两个领域:入口领域和出口领域。

Among the many classifiers available to the Traffic Control subsystem is one called the routing table based classifier that can classify routes based on realms. Realms are numerical tags that can be assigned to both policies and routes. Each route and policy can be assigned up to two realms: an ingress realm and an egress realm.

在以下小节中,我们将首先了解如何配置领域以更加熟悉该功能。然后,我将描述路由代码用于从领域配置派生分类标签(将由流量控制使用)的逻辑。

In the following subsections, we will first see how realms are configured to get more familiar with the feature. Then I'll describe the logic used by the routing code to derive the classification tag (which will be used by Traffic Control) from the realms' configuration.

IPROUTE2 包中包含的文件ip-cref.ps提供了领域的目的和使用的一些示例。在本书中,我们将仅考虑如何通过 IPROUTE2 命令配置领域。

The file ip-cref.ps included with the IPROUTE2 package offers some examples of the purpose and use of realms. In this book, we will consider only how realms are configured via IPROUTE2 commands.

路由和策略均通过ip命令配置。ip后面的第一个关键字确定要配置的对象类型。其中route关键字表示路由,rule关键字表示策略。

Both routes and policies are configured with the ip command. The first keyword that follows ip determines the object type you want to configure. The route keyword denotes a route, and the rule keyword denotes a policy.

路由、流量控制和防火墙 (Netfilter) 之间的交互

图 31-6。路由、流量控制和防火墙 (Netfilter) 之间的交互

Figure 31-6. Interactions among routing, Traffic Control, and Firewall (Netfilter)

配置策略域

Configuring policy realms

路由策略是使用 IPROUTE2 的 iprule命令配置的。其语法为:

Routing policies are configured with the IPROUTE2's ip rule command. Its syntax is:

ip 规则添加...领域[source_realm/]destination_realm
ip rule add...realms [source_realm/]destination_realm

正如您所看到的,源领域是可选的,而目标领域不是可选的,这意味着如果您仅提供一个值,则可以配置目标领域。

As you can see, the source realm is optional while the destination realm is not, which means that if you provide only one value, you configure the destination realm.

以下是配置策略领域的几个命令示例。以下将策略目标领域 128 与源自子网 10.0.1.0/24 的所有流量相关联:

Here are a couple of examples of commands that configure policy realms. The following associates the policy destination realm 128 with all traffic that originates in the subnet 10.0.1.0/24:

ip 规则添加来自 10.0.1.0/24 领域 128
ip rule add from 10.0.1.0/24 realms 128

以下命令将策略源领域 64 和策略目标领域 128 与源自 10.0.1.0/24 子网且寻址到 10.0.2.0/24 子网的流量关联:

The following command associates the policy source realm 64 and the policy destination realm 128 with traffic that originates in the 10.0.1.0/24 subnet and that is addressed to the 10.0.2.0/24 subnet:

ip 规则从 10.0.1.0/24 添加到 10.0.2.0/24 领域 64/128
ip rule add from 10.0.1.0/24 to 10.0.2.0/24 realms 64/128

配置路由领域

Configuring route realms

路由领域的配置与策略领域非常相似。用于此目的的 IPROUTE2 命令的语法是:

A route's realms are configured very similarly to policy realms. The syntax of the IPROUTE2 command for this purpose is:

ip 路由添加...领域 [ source_realm/]destination_realm
ip route add...realms [source_realm/]destination_realm

请注意,即使该命令的帮助消息不显示源领域和目标领域,其语法也与策略相同。

Note that even though the command's help message does not show both the source and destination realms, the syntax is just the same as for policies.

以下是针对定向到 10.0.1.0/24 子网的流量的示例命令,该命令将流量转发到地址为 10.0.0.3 的网关并分配给目标领域 100:

Here is an example command for traffic directed to the 10.0.1.0/24 subnet that forwards the traffic to the gateway with address 10.0.0.3 and assigns to the destination realm 100:

ip 路由通过 10.0.0.3 领域 100 添加 10.0.1.0/24
ip route add 10.0.1.0/24 via 10.0.0.3 realms 100

在以下命令中,定向到 10.0.1.0/24 子网的流量将转发到地址为 10.0.0.3 的网关,并分配给源领域 100 和目标领域 200:

In the following command, traffic directed to the 10.0.1.0/24 subnet is forwarded to the gateway with address 10.0.0.3 and assigned to the source realm 100 and the destination realm 200:

ip 路由通过 10.0.0.3 领域 100/200 添加 10.0.1.0/24
ip route add 10.0.1.0/24 via 10.0.0.3 realms 100/200

计算路由标记

Computing the routing tag

由于领域可以分配给单个路由和整个策略,因此路由决策可以为单个方向提出两个领域:例如,从策略派生的入口目的地领域和从路由派生的另一个入口目的地领域。在这种情况下,从路由导出的领域被赋予更高的优先级。通常,这样的决定仅对于目的地领域是必要的;管理员很少根据路由来定义源领域。

Because realms can be assigned to both individual routes and whole policies, a routing decision can come up with two realms for a single direction: for instance, an ingress destination realm derived from the policy and another ingress destination realm derived from the route. In such a case, the realm derived from the route is given higher priority. Usually, such a decision is necessary only for a destination realm; administrators rarely define source realms on the basis of the route.

如果某个领域丢失(路由或策略均未提供),则内核会计算反向路由(从本地主机返回到正在分类的数据包的源)并检查是否可以使用关联的领域。例如,如果内核无法导出入口流量的源领域,它会找出并使用反向路径上出口流量的目标领域。该启发式假设两个方向上的领域配置应该是对称的。

If a realm is missing—not provided by either route or policy—the kernel computes the reverse route (from the local host back to the source of the packet being classified) and checks whether it can use the associated realms. For instance, if the kernel cannot derive a source realm for ingress traffic, it figures out and uses the destination realm for egress traffic on the reverse path. This heuristic assumes that the realm configurations on the two directions should be symmetric.

让我们看一个使用图 31-7中的拓扑的简单示例,该示例显示了两个网络之间的路由器。策略路由配置表示来自子网 10.0.1.0/24 的流量属于 Realm A,来自子网 10.0.2.0/24 的流量属于 Realm B。假设没有配置路由 Realm;仅显示了两个策略领域配置仅图 31-7。这两种策略仅提供源领域,因此在转发时,为入口指定领域,但不为出口指定领域。现在假设路由器从主机 10.0.1.100(领域 A)接收到目标地址 10.0.2.200(领域 B)的数据包。当路由子系统进行查找以路由数据包时,它还会计算路由标记。下面的列表解释了会发生什么。

Let's look at a simple example using the topology in Figure 31-7, which shows a router between two networks. The policy routing configuration says that traffic coming from subnet 10.0.1.0/24 belongs to Realm A, and that traffic coming from subnet 10.0.2.0/24 belongs to Realm B. Assume that no route realm is configured; only the two policy realm configurations shown in Figure 31-7. Both of those policies provide only the source realm—so when forwarding, a realm is specified for ingress but not for egress. Let's suppose now that the router receives a packet from host 10.0.1.100 (Realm A) directed to the destination address 10.0.2.200 (Realm B). When the routing subsystem makes a lookup to route the packet, it also computes the routing tag. The following list explains what happens.

  1. 路由查找返回路由 R2 和策略 P1。由于路由 R2 上未配置领域,因此使用策略 P1 中的源领域 A。

  2. The routing lookup returns route R2 and policy P1. Because no realm is configured on route R2, the source realm A from policy P1 is used.

  3. 由于目标领域未初始化,因此内核计算从 10.0.2.200 到 10.0.1.100 的反向路由。这次路由查找返回路由 RT1 和策略 P2。再次,在匹配的路由 RT1 上没有配置领域,因此内核依赖于策略领域,即 B。但是,由于这是在反向查找期间找到的,因此反向路径上的源领域 B 被用作目的地前进道路上的境界。

  4. Because the destination realm is not initialized, the kernel computes the reverse route from 10.0.2.200 to 10.0.1.100. The routing lookup this time returns route RT1 and policy P2. Once again, no realm is configured on the matching route RT1, so the kernel relies on the policy realm, which is B. However, because this was found during a reverse lookup, the source realm B on the reverse path is used as a destination realm on the forward path.

最后,路由标记被初始化为源领域A和目标领域B。当稍后遍历QoS层时,它可以使用这两个领域对数据包进行正确分类。

In the end, the routing tag is initialized to source realm A and destination realm B. When the QoS layer is traversed later, it can use those two realms to correctly classify the packet.

领域配置示例

图 31-7。领域配置示例

Figure 31-7. Example of realm configuration

图 31-8总结了用于计算标签的逻辑。

Figure 31-8 summarizes the logic used to compute the tag.

策略路由和基于防火墙的分类器

Policy Routing and Firewall-Based Classifier

Netfilter 防火墙软件可以对流​​量进行分类,以根据其过滤标准判断是否需要丢弃或破坏数据包。防火墙还可以配置为使用其强大的分类引擎对数据包进行简单分类,以便为其他内核子系统提供服务。防火墙在网络堆栈中有多个钩子。如果路由或流量控制在放置标签的挂钩之一之后运行,则这些子系统可以查看标签并对其进行操作。本章前面的图 31-6显示了各个子系统和防火墙挂钩访问数据包的顺序。

The Netfilter firewall software can classify traffic to see whether, based on its filtering criteria, it needs to drop or mangle packets. The firewall can also be configured to simply classify a packet using its powerful classification engine just to provide a service to other kernel subsystems. The firewall has multiple hooks in the network stack. If Routing or Traffic Control runs after one of the hooks that places a tag, those subsystems can see and act on the tag. Figure 31-6, earlier in the chapter, showed the sequence in which various subsystems and firewall hooks access packets.

用于计算路由标记的逻辑

图 31-8。用于计算路由标记的逻辑

Figure 31-8. Logic used to compute the routing tag

路由协议守护进程

Routing Protocol Daemons

可以从三个主要来源将路由插入到内核路由表中:

Routes can be inserted into the kernel routing tables from three main sources:

  • 通过用户命令进行静态配置(例如ip RouteRoute

  • Static configuration via user commands (e.g., ip route, route)

  • 通过边界网关协议 (BGP)、外部网关协议 (EGP) 和开放最短路径优先 (OSPF) 等路由协议进行动态配置,作为用户空间路由守护进程实现

  • Dynamic configuration via routing protocols such as Border Gateway Protocol (BGP), Exterior Gateway Protocol (EGP), and Open Shortest Path First (OSPF), implemented as user-space routing daemons

  • 由于配置不理想而由内核接收和处理的 ICMP 重定向消息

  • ICMP redirect messages received and processed by the kernel due to suboptimal configurations

我们将在第 36 章中介绍第一个来源,我们将在本章后面看到第三个来源。现在让我们看一下第二个来源,特别是可用于 Linux 系统的路由守护程序。它们与内核交互的细节将在第 32 章中介绍。以下是不再维护但仍然有趣的项目列表:

We will cover the first source in Chapter 36, and we will see the third one later in this chapter. Let's take a look at the second source now, and in particular the routing daemons available for Linux systems. The details of their interaction with the kernel will be covered in Chapter 32. Here is a list of projects that are no longer maintained but are nevertheless interesting:

路由
Routed

最古老的 Unix 路由协议守护进程。它仅包括 RIP 协议,版本 1 和 2(请参阅 RFC 2453)。

The oldest Unix routing protocol daemon. It includes only the RIP protocol, both Versions 1 and 2 (see RFC 2453).

GateD ( http://www.ated.org )
GateD (http://www.gated.org)

包括大多数路由协议。它最初是 Merit GateD 联盟的一个研究项目,但其权利后来被 NextHop 收购。研究版本不再维护。

Includes most of the routing protocols. It started as a research project by the Merit GateD Consortium, but its rights were later acquired by NextHop. The research version is no longer maintained.

鸟 ( http://bird.network.cz )
BIRD (http://bird.network.cz)

布拉格查尔斯大学启动了一个项目。它支持最常见的路由协议。

A project started at the Charles University in Prague. It supports the most common routing protocols.

以下是仍在维护和部署的路由协议套件列表:

The following is a list of routing protocol suites that are still maintained and deployed:

斑马 ( http://www.zebra.org )
Zebra (http://www.zebra.org)

包括大多数路由协议。它已经被广泛部署,并且其邮件列表也被积极使用。然而,发布周期变得有点慢,导致了 Quagga 的诞生。

Includes most of the routing protocols. It is already widely deployed and its mailing lists are actively used. However, the release cycle has become a little slow, leading to the birth of Quagga.

斑驴 ( http://www.quagga.net )
Quagga (http://www.quagga.net)

Zebra 的一个分支,创建于 2003 年,旨在为用户社区提供更快的开发周期、更快的错误修复和更多文档。

A fork of Zebra that was created in 2003 to provide the user community with a faster development cycle, faster bug fixing, and more documentation.

XORP ( http://www.xorp.org )
XORP (http://www.xorp.org)

加州伯克利国际计算机科学研究所启动了一个新项目。

A new project started at the International Computer Science Institute in Berkeley, California.

请参阅括号内的 URL 来准确查找每个包提供的协议和扩展。

Refer to the URLs within the parentheses to find exactly what protocols and extensions each package provides.

路由守护进程的实现没有在本书中介绍,因为它们不属于内核,但我们在这里简要地了解一下它们如何与内核对话。例如,了解守护进程如何将从对等方或用户配置中获悉的路由注入到路由表中,以及如何删除失效路由等,这一点非常重要。

The routing daemon implementations are not covered in this book because they do not belong to the kernel, but we briefly look here at how they talk to the kernel. It is important to know, for instance, how the daemons inject into the routing tables the routes that they learn from their peers or from user configuration, and how they remove defunct routes.

每个守护进程在用户空间中维护自己的路由表。它们不用于直接选择任何路由——仅使用内核内存中的内核路由表。然而,正如本节前面提到的,守护进程是用于填充内核表的源之一。前面介绍的大多数守护进程都实现了多种路由协议。每个路由协议在运行时都保留自己的路由表。根据守护进程的设计,每个协议可能会自行将路由安装到内核的路由表中(如图31-9),或者协议可能会在进行通信的守护进程中共享一个公共层到内核​​(如右图所示)图 31-9左侧所示)。所使用的方法是用户空间设计的选择,超出了本书的范围。

Each daemon maintains its own routing tables in user space. These are not used to select any routes directly—only the kernel's routing tables in kernel memory are used for that. However, the daemons are one of the sources used to populate the kernel tables, as mentioned earlier in this section. Most of the daemons introduced earlier implement multiple routing protocols. Each routing protocol, when running, keeps its own routing table. Depending on the design of the daemon, each protocol might install routes into the kernel's routing table on its own (as shown on the left side of Figure 31-9), or the protocols may share a common layer within the daemon that does the talking to the kernel (as shown on the right side of Figure 31-9). The approach used is a user-space design choice outside the scope of this book.

路由协议和内核之间的通信是双向的:

Communication between routing protocols and the kernel is bidirectional:

  • 路由协议将路由安装到内核的路由表中,并删除它们确定过期或不再有效的路由。

  • The routing protocols install routes into the kernel's routing table and remove routes they have determined to be expired or no longer valid.

  • 内核通知路由协议有关新路由的安装或删除,以及本地设备链路状态的更改(这当然会间接影响所有关联的路由)。仅当路由守护程序通过 Netlink 套接字与内核对话时,这才可能实现。即双向通道。

  • The kernel notifies routing protocols about the installation or removal of new routes, and about a change of state in a local device link (which of course indirectly affects all the associated routes). This is possible only when the routing daemons talk to the kernel via a Netlink socket; that is, a bidirectional channel.

IPROUTE2 软件包不仅允许用户配置路由,还允许用户监听由内核和路由守护进程生成的上述通知。因此,管理员可以记录它们或将它们转储到屏幕上以进行调试。

The IPROUTE2 package allows the user not only to configure routes, but also to listen to the aforementioned notifications generated by the kernel and by routing daemons. Thus, an administrator can log them or dump them on the screen for debugging purposes.

用户空间和内核空间之间的接口

图 31-9。用户空间和内核空间之间的接口

Figure 31-9. Interface between user space and kernel

详细监控

Verbose Monitoring

当内核中添加对此选项的支持并启用该选项(默认情况下禁用)时,当输入数据包具有可疑或无效的源或目标 IP 地址时,内核会在控制台上打印警告消息。这些消息的速率限制为每五秒一次,以避免潜在的 DoS 攻击。

When support for this option is added to the kernel and the option is enabled (it is disabled by default), the kernel prints warning messages on the console when input packets have suspicious or invalid source or destination IP addresses. These messages are rate limited to one every five seconds, to avoid potential DoS attacks.

由于源地址或目标地址错误,路由子系统中的健全性检查丢弃的入口数据包会触发警告消息。内核可以使用第 30 章表 30-1表 30-3中列出的分类轻松地进行其中一些检查。概括来说,这些分类是:

Ingress packets that are dropped by sanity checks in the routing subsystem, due to faulty source or destination addresses, trigger a warning message. The kernel can make some of these checks easily using the classifications listed in Table 30-1 and Table 30-3 in Chapter 30. In summary, these classifications are:

  • 源地址:多播、环回、保留、无效 (zeronet)

  • Source address: Multicast, Loopback, Reserved, Invalid (zeronet)

  • 目标地址:环回、保留、无效 (zeronet)

  • Destination address: Loopback, Reserved, Invalid (zeronet)

内核根据路由表对入口数据包进行额外的健全性检查。尤其:

The kernel makes additional sanity checks on ingress packets based on the routing table. In particular:

  • 启用反向路径过滤(反 IP 欺骗检查)时,必须可以通过接收数据包的同一接口访问源 IP 地址。请参阅“反向路径过滤”部分。

  • When reverse path filtering is enabled (an anti-IP-spoofing check), the source IP address must be reachable through the same interface from which the packet was received. See the section "Reverse Path Filtering."

  • 源 IP 地址不能是子网广播地址或接收接口上配置的地址之一。此检查可以帮助防止 IP 欺骗尝试(即,另一台主机声称与接收接口具有相同的 IP 地址),并且还可以检测地址重复的情况,例如可能由 DHCP 错误配置引起的地址重复情况。

  • The source IP address cannot be a subnet broadcast address or one of the addresses configured on the receiving interface. This check can help prevent IP spoofing attempts (i.e., another host claiming the same IP address as the receiving interface), and can also detect cases of address duplication such as might be caused by DHCP misconfiguration.

启用详细监控功能后,ICMP 层还可以在特定条件下生成警告消息:

When the Verbose Monitoring feature is enabled, the ICMP layer can also generate warning messages under specific conditions:

ICMP 重定向消息的传输
Transmission of ICMP REDIRECT messages

当内核已将一定数量的 ICMP REDIRECT 消息发送到似乎忽略它们的远程主机时,内核会打印一条警告。精确的数字是可配置的。请参阅“传输 ICMP_REDIRECT 消息”部分。

When the kernel has sent a certain number of ICMP REDIRECT messages to a remote host that appears to ignore them, the kernel prints a warning. The precise number is configurable. See the section "Transmitting ICMP_REDIRECT Messages."

接收 ICMP REDIRECT 消息
Reception of ICMP REDIRECT message

每当入口 ICMP 重定向被拒绝时,内核都会打印一条警告。入口 ICMP REDIRECT 消息的处理比其传输稍微复杂一些,因为内核可能会出于多种原因拒绝入口 ICMP REDIRECT 消息,其中一些原因可由用户配置。请参阅“处理入口 ICMP_REDIRECT 消息”部分。

Whenever an ingress ICMP redirect is rejected, the kernel prints a warning. The processing of ingress ICMP REDIRECT messages is a little more complex than their transmission, because the kernel may reject ingress ICMP REDIRECT messages for several reasons, some of them configurable by the user. See the section "Processing Ingress ICMP_REDIRECT Messages."

ICMP_REDIRECT 消息

ICMP_REDIRECT Messages

ICMP 协议定义了许多不同的消息来控制流量并通知主机网络问题。其中一种消息“REDIRECT” [ * ]用于通知流量源其路由使用情况不佳。有关 ICMP 消息的详细说明,请参阅第 25 章。在本章中,我们重点关注路由子系统提供的调整参数,以决定是否处理入口 ICMP REDIRECT 消息以及在满足默认所需条件时是否传输消息。

The ICMP protocol defines a number of different messages to control traffic flow and notify hosts of network problems. One such message, REDIRECT,[*] is used to notify a source of traffic about its suboptimal use of routing. Refer to Chapter 25 for a detailed description of the ICMP messages. In this chapter, we focus on the tuning parameters provided by the routing subsystem to decide whether to process ingress ICMP REDIRECT messages and whether to transmit one when the default required conditions are met.

关于是否应发送 ICMP REDIRECT 以及是否应处理入口 ICMP REDIRECT 的决定可能会受到用户配置的影响。特别是,用户可以说出是否:

The decision about whether an ICMP REDIRECT should be sent, and whether an ingress ICMP REDIRECT should be processed, can be influenced by user configuration. In particular, the user can say whether:

  • 可以接受或生成入口或出口 ICMP REDIRECT 消息。正如我们将在第 36 章中看到的,这是可以根据每个设备进行配置的。

  • Ingress or egress ICMP REDIRECT messages can be accepted or generated. As we will see in Chapter 36, this is configurable on a per-device basis.

  • 对于入口 ICMP 数据包,还可以指定是否仅接受安全重定向。当消息通告的新网关已被本地主机识别为网关时,ICMP 重定向被认为是安全的。例如,可以通过检查路由表来确定建议的网关是否已用作任何已配置路由的下一跳。

  • For ingress ICMP packets, it is also possible to specify whether to accept only secure redirects. An ICMP REDIRECT is considered secure when the new gateway advertised by the message is already known by the local host as a gateway. This can be determined, for instance, by checking in the routing table whether the suggested gateway is already used as the next hop for any of the configured routes.

  • 每个设备都可以配置一个标志,表明该设备是否连接到共享介质。

  • Each device can be configured with a flag that says whether the device is attached to a shared medium.

最后一个项目符号值得稍微解释一下,这将在下一节中提供。

The last bullet deserves a little explanation, which is provided in the next section.

共享媒体

Shared Media

20 世纪 90 年代初,IP 协议设计者开始关注在以太网等介质上创建 LAN 并将这些 LAN 连接到其他网络的趋势(当时有点新,但现在几乎是普遍的)。有时,管理员会将配置在不同 IP 子网上的主机组连接到单个 LAN,同时使用路由器将它们分开。这样做有很多与历史或方便相关的原因;如今,几乎没有任何理由这样做,也很少找到这样做的理由。尽管如此,路由子系统的设计必须能够处理它。

In the early 1990s, IP protocol designers started looking at the tendency (then somewhat new, but now almost universal) of creating LANs on media such as Ethernet and attaching these LANs to other networks. Sometimes administrators would connect groups of hosts configured on different IP subnets to a single LAN while separating them with routers. There are many reasons related to history or convenience for doing this; nowadays there is rarely any reason to do it and it is rarely found. Nonetheless, the routing subsystem must be designed to handle it.

当配置在不同子网上的主机插入同一 LAN 时,IP 路由文档将其称为共享介质。请注意,该术语指的是网络配置而不是设备的功能。换句话说,在共享 L2 连接的所有主机也共享 IP 子网的正常情况下,该术语不适用;这里的问题也不是。在本节中,我们关心的问题是更有可能生成ICMP REDIRECT消息。

When hosts configured on different subnets are plugged into the same LAN, IP routing documents call it a shared medium. Note that this term refers to network configuration rather than the device's capabilities. In other words, in the normal case where all the hosts sharing an L2 connection also share an IP subnet, this term does not apply; nor do the issues here. In this section, the issue concerning us is that ICMP REDIRECT messages are more likely to be generated.

图 31-10显示了共享介质的示例。连接到同一 LAN 的主机上配置了三个不同的子网。两个路由器 RT1 和 RT2 用于连接 IP 子网:每个路由器都是两个子网的一部分,每一侧都有一个配置有地址的 NIC。

Figure 31-10 shows an example of a shared medium. Three different subnets are configured on the hosts connected to the same LAN. The two routers RT1 and RT2 are used to connect the IP subnets: each router is part of two subnets, having one NIC configured with an address on each side.

接下来描述图 31-10中配置路由的典型方法(尽管不是唯一的方法)。

A typical way to configure routing in Figure 31-10 (although not the only way) is described next.

共享媒体拓扑的示例配置

图 31-10。共享媒体拓扑的示例配置

Figure 31-10. Sample configuration for a shared media topology

  • 子网 10.0.0.0/24 的主机将 RT1 定义为其默认网关,并配置为通过该网关将所有流量发送到其他子网。其他两个子网和用于访问 Internet 的网关可通过 RT1 访问。

  • The hosts of subnet 10.0.0.0/24 define RT1 as their default gateway and are configured to send all traffic to other subnets through that gateway. The other two subnets and the gateway used to reach the Internet are reachable via RT1.

  • 子网 10.0.1.0/24 的主机具有类似的配置,使用 RT2 作为默认网关。但是,他们需要一条通过 RT1 的额外路由才能到达子网 10.0.0.0/24。

  • The hosts of subnet 10.0.1.0/24 have a similar configuration, using RT2 as their default gateway. However, they need one extra route through RT1 to reach subnet 10.0.0.0/24.

  • 子网 10.0.2.0/24 的主机使用 DG 作为默认网关,并配置两条显式路由通过 RT2 到达另外两个子网。

  • Hosts of subnet 10.0.2.0/24 use DG as their default gateway, and are configured with two explicit routes to reach the other two subnets via RT2.

此路由方案的关键方面是它指定了低效的路由。例如,子网 10.0.0.0/24 和 10.0.0.2/24 的主机可以在 L2 层上交换数据包而无需任何路由,但配置告诉它们使用两个路由器。更聪明的配置包括仅在主机上配置默认网关,并让默认网关负责到 LAN 中其他子网的路由。

The key aspect of this routing scenario is that it specifies inefficient routes. The hosts of subnet 10.0.0.0/24 and 10.0.0.2/24, for instance, could exchange packets on the L2 layer without any routing, but the configuration tells them to use two routers. A cleverer configuration would consist in configuring only the default gateway on the hosts, and have the default gateways take care of the routes to the other subnets in the LAN.

幸运的是,路由子系统有办法自己克服这种低效率,并逐渐找到直接连接的主机。该机制是 ICMP REDIRECT 消息。我们将忽略 DG 和互联网连接的存在。

Luckily, the routing subsystem has ways to overcome this inefficiency on its own and gradually find the directly connected hosts. The mechanism is ICMP REDIRECT messages. We'll ignore the presence of DG and the Internet connection.

假设主机 A 想要与主机 B 通话。根据其路由表,主机 A 发现可以通过路由器 RT1 到达主机 B。然而,当 RT1 收到主机 A 发送并寻址到主机 B 的数据包时,它意识到主机 A 可以将数据包直接发送到主机 B,因为它们(主机 A 和主机 B)都可以通过同一设备 eth0 访问。这看起来像是触发生成 ICMP 重定向的经典条件:次优路由。然而,有一个问题:即使主机 A 和主机 B 连接到相同的共享介质,因此从链路层的角度(即以太网)可以相互通信,但从 IP 层的角度来看,这是不可能的。主机 A 不知道 10.0.1.0/24 子网的主机可通过eth0访问。主机 A 只知道可以通过 RT1 访问 10.0.1.0/24 子网。

Suppose Host A wants to talk to Host B. According to its routing table, Host A sees that Host B is reachable via router RT1. However, when RT1 receives a packet sent by Host A and addressed to Host B, it realizes that Host A could have sent the packet directly to Host B because both of them (Host A and Host B) are reachable via the same device eth0. This looks like the classic condition that triggers the generation of an ICMP REDIRECT: suboptimal routing. However, there's a catch: even if Host A and Host B are connected to the same shared medium and can therefore talk to each other from a link layer point of view (i.e., Ethernet), from the IP layer perspective that's not possible. Host A does not know that the hosts of the 10.0.1.0/24 subnet are reachable via eth0. All Host A knows is that the 10.0.1.0/24 subnet is reachable via RT1.

要了解原因,如有必要,请回顾第 26 章中的“何时传输和处理征求请求”部分。对于一台主机要与另一台基于 L3 地址的主机通信,它必须首先进行 L3 到 L2 地址解析。发送主机只能对属于它所连接的子网之一的主机执行此操作;对于所有其他人来说,找到主机是路由器的事。在我们的例子中,主机 B 不属于主机 A 的子网。这意味着对于告诉主机 A 直接与主机 B 通信的 ICMP 重定向才能工作,主机必须能够接受所谓的外部重定向:建议的新下一跳不属于重定向接收者已知的任何本地子网的重定向。

To understand why, look back, if necessary, to the section "When Solicitation Requests Are Transmitted and Processed" in Chapter 26. For a host to talk to another one based on L3 addresses, it must first make an L3-to-L2 address resolution. A sending host can do that only for hosts that belong to one of the subnets it is connected to; for all others, finding the host is a router's business. In our case, Host B does not belong to Host A's subnet. This means that for an ICMP REDIRECT that tells Host A to talk to Host B directly to work, hosts must be able to accept what is called a foreign redirect: a redirect whose suggested new next hop does not belong to any of the local subnets known to the receiver of the redirect.

外部重定向仅在如图 31-10所示的共享媒体场景中才有意义。这是因为主机 A 收到重定向并接受重定向后,会向主机 B 的地址发送 ARP 请求。[ * ]在图31-10的拓扑中,由于共享介质,主机B将收到ARP请求。但如果主机 B 位于另一个 LAN 中,它将无法接收来自主机 A 的任何 ARP 请求。[ ]

Foreign redirects are meaningful only in shared media scenarios like the one depicted in Figure 31-10. This is because after Host A receives the redirect and accepts it, it sends an ARP request for Host B's address.[*] In the topology of Figure 31-10, thanks to the shared medium, Host B will receive the ARP request. But if Host B was located in another LAN, it would not be able to receive any ARP requests from Host A.[]

有趣的是,如果主机 A 想要与主机 C 通话并使用 RT1,会发生什么情况。根据 RT1,主机 C 可通过 RT2 到达。因此,RT1 向主机 A 发送 ICMP REDIRECT,提供 RT2 作为建议的新网关。稍后,当主机 A 要求 RT2 向主机 C 发送另一个数据包时,RT2 将检测到相同的次优路由条件,因此 RT2 将发送 ICMP REDIRECT,主机 A 最终将意识到它可以直接到达主机 C。简而言之,通过外部重定向解决共享媒体连接的过程是迭代的。

It is interesting to see what happens if Host A wants to talk to Host C and uses RT1. According to RT1, Host C is reachable via RT2. So RT1 sends an ICMP REDIRECT to Host A, providing RT2 as the suggested new gateway. RT2 will detect the same suboptimal routing condition later when it is asked by Host A to send another packet to Host C, so RT2 will send an ICMP REDIRECT and Host A will finally realize that it can reach Host C directly. In short, the process of resolving shared media connections through foreign redirects is iterative.

在本节中,您了解了配置连接到不同 IP 子网上的同一共享介质的主机的含义。RFC 1620 更详细地介绍了该主题,非常值得阅读。在接下来的两节中,您将看到用户如何显式配置要连接到共享介质的接口,以影响入口 ICMP REDIRECT 消息的传输和处理,以便处理类似于本节中的场景适当地。

In this section, you have seen the implications of configuring the hosts connected to the same shared medium on different IP subnets. RFC 1620 goes into more detail on the subject and is well worth reading. In the next two sections, you will see how the user can explicitly configure an interface to be connected to a shared medium, to influence the transmission and the processing of ingress ICMP REDIRECT messages so that scenarios like the one in this section are taken care of properly.

传输 ICMP_REDIRECT 消息

Transmitting ICMP_REDIRECT Messages

当路由子系统已经发现入口和出口设备匹配时,检查共享媒体配置似乎是多余的。事实上,这项检查并不是多余的。由于使用上一节中描述的 ICMP REDIRECT 消息而使用转发快捷方式可能并不总是可取的。这些快捷方式允许主机绕过某些路由器,但系统管理员可能会使用这些路由器将策略应用于所有流量。例如,RT1 和 RT2 是防火墙[ * ]或代理主机。

It may seem superfluous to check the shared media configuration when the routing subsystem has already seen that the ingress and egress devices match. In fact, that check is not superfluous. The use of forwarding shortcuts, as a result of using the ICMP REDIRECT messages described in the previous section, may not always be desirable. The shortcuts allow hosts to bypass certain routers, but the system administrator might be using those routers to apply policies to all traffic. An example would be where RT1 and RT2 were firewalls[*] or proxy hosts.

图31-11显示了路由代码在路由需要转发的入口数据包时遵循的逻辑。该流程图中只有一部分我尚未描述:设备未配置为共享介质时进行的检查。在这种情况下,仅当通过查询路由表找到的下一跳属于与发送方相同的子网(由发送方的源 IP 地址标识)时,才会向发送方发送 ICMP 重定向消息。如果不是这样,发送者将无法(根据路由器的知识)到达新的下一跳。这正是为什么在同一子网上检查共享媒体之前检查网关的原因。

Figure 31-11 shows the logic that the routing code follows when it routes an ingress packet that needs to be forwarded. There is only one part of the flowchart that I have not described yet: the check made when the device is not configured as a shared medium. In that case, the sender is sent an ICMP REDIRECT message only if the next hop found by consulting the routing table belongs to the same subnet as the sender (which is identified by the sender's source IP address). When this is not true, the sender would not (according to the router's knowledge) be able to reach the new next hop. And this is exactly why the check for shared media precedes the check for a gateway on the same subnet.

请注意,由于内核 2.6 中已删除对快速 NAT 的支持(请参阅第 32 章中的“最近删除的选项” 部分),因此对 NAT/MASQ 的检查永远不会成功。

Note that since support for Fast NAT has been dropped in kernel 2.6 (see the section "Recently Dropped Options" in Chapter 32), the check on NAT/MASQ is never successful.

生成 ICMP REDIRECT 所需的条件

图 31-11。生成 ICMP REDIRECT 所需的条件

Figure 31-11. Conditions needed to generate an ICMP REDIRECT

有关例程如何决定是否传输 ICMP REDIRECT 消息的详细信息,请参见第 20 章和第 35 章中的“转发”部分。另外,有关负责传输的 ICMP 代码的更多详细信息,请参阅第 25 章。ip_forward

See Chapter 20 and the section "Forwarding" in Chapter 35 for details on how the ip_forward routine decides whether to transmit ICMP REDIRECT messages. Also, see Chapter 25 for more details on the ICMP code that takes care of transmission.

处理入口 ICMP_REDIRECT 消息

Processing Ingress ICMP_REDIRECT Messages

为了接受入口 ICMP 重定向,它需要通过一些健全性检查并符合用户配置。图 31-12用流程图显示了所有逻辑。我们将一点一点地讨论它。

For an ingress ICMP REDIRECT to be accepted, it needs to pass a few sanity checks and comply with the user configuration. Figure 31-12 shows all the logic with a flowchart. We'll go over it piece by piece.

处理入口 ICMP 重定向所需的条件

图 31-12。处理入口 ICMP 重定向所需的条件

Figure 31-12. Conditions needed to process an ingress ICMP REDIRECT

首先要通过一些基本的健全性检查,以及要遵守的一个用户配置:

First come a few basic sanity checks to pass, and one user configuration to comply with:

  • ICMP REDIRECT 通告的新网关应与当前网关不同(否则,无需 ICMP REDIRECT)。

  • The new gateway advertised by the ICMP REDIRECT should be different from the current one (otherwise, there is no need for an ICMP REDIRECT).

  • 新网关 IP 地址不能为多播、无效 (zeronet) 或保留。

  • The new gateway IP address cannot be Multicast, Invalid (zeronet), or Reserved.

  • 接收接口必须配置为接受入口 ICMP 重定向。

  • The receiving interface must be configured to accept ingress ICMP REDIRECTs.

其次,路由子系统必须考虑共享媒体配置:

Second, the routing subsystem must take into account the shared media configuration:

如果设备未配置为共享介质
If the device is not configured as a shared medium

在这种情况下,主机只有在再次根据路由表知道该网关与旧网关位于同一子网上(即直接连接到主机)时才可以接受新网关。

In this case, the host can accept a new gateway only if it knows, once again according to the routing table, that the gateway resides on the same subnet as the old one (i.e., it is directly connected to the host).

如果设备配置为共享介质
If the device is configured as a shared medium

只要主机根据路由表知道如何到达新网关,就会接受新网关。执行另外两项健全性检查:网关的地址不能是主机的本地地址,并且不能是广播地址。

The new gateway is accepted as long as, according to the routing table, the host knows how to reach it. Two other sanity checks are performed: the gateway's address must not be local to the host, and must not be a broadcast address.

正如“共享媒体”部分中提到的,可以配置设备,使其仅在新网关已知为网关时才接受重定向。

As mentioned in the section "Shared Media," it is possible to configure a device so that it accepts redirects only when the new gateway is already known to it as a gateway.

ip_rt_redirect基于本节描述的逻辑,处理入口 ICMP REDIRECT 消息的函数将在第25 章中进行分析。

The ip_rt_redirect function that processes ingress ICMP REDIRECT messages, based on the logic described in this section, is analyzed in Chapter 25.

反向路径过滤

Reverse Path Filtering

我们在第 30 章的“路由的基本要素”一节中了解了什么是非对称路由。非对称路由并不常见,但在某些情况下可能是必要的。Linux 的默认行为是认为非对称路由可疑,因此根据路由表丢弃其源 IP 地址无法通过接收数据包的设备到达的任何数据包。然而,这个行为可以通过/proc在每个设备的基础上进行调整,我们将在第 36 章中看到。另请参见第 35 章中的“输入路由”部分。

We saw what an asymmetric route is in the section "Essential Elements of Routing" in Chapter 30. Asymmetric routes are not common, but may be necessary in certain cases. The default behavior of Linux is to consider asymmetric routing suspicious and therefore to drop any packet whose source IP address is not reachable through the device the packet was received from, according to the routing table. However, this behavior can be tuned via /proc on a per-device basis, as we will see in Chapter 36. See also the section "Input Routing" in Chapter 35.

第30章的图30-7(a)和(b)中,我们看到了一个恶意用户使用同一子网内另一台主机的源IP地址发送ICMP ECHO REQUEST消息的示例。图31-13显示了另一个恶意用户,这次使用目标子网中的一个地址作为其源IP地址(例如,其受害者)。如图31-13(a) 所示,Linux 默认情况下会检测并丢弃此尝试。图 31-13(b)显示了如果路由器 RT 没有丢弃 ICMP ECHO REQUEST 消息,将会发生什么情况。

In Figure 30-7(a) and (b) in Chapter 30, we saw an example of a malicious user sending ICMP ECHO REQUEST messages with the source IP address of another host within the same subnet. Figure 31-13 shows another malicious user, this time using as its source IP address (e.g., its victim) an address in the target subnet. As Figure 31-13(a) shows, this attempt is detected and dropped by Linux by default. Figure 31-13(b) shows what would have happened if the ICMP ECHO REQUEST message was not dropped by the router RT.

反向路径过滤示例

图 31-13。反向路径过滤示例

Figure 31-13. Example of reverse path filtering

该示例使用定向广播 ICMP 数据包,但反向路径过滤适用于任何类型的流量。

The example uses a directed broadcast ICMP packet, but reverse path filtering applies to any kind of traffic.




[ * ]有四种 ICMP REDIRECT 消息子类型。在本章中,我们仅关注重定向主机,用于重定向寻址到特定 IP 地址的流量。有关其他亚型,请参阅第 25 章。

[*] There are four ICMP REDIRECT message subtypes. In this chapter, we look only at REDIRECT HOST, used to redirect traffic addressed to a specific IP address. See Chapter 25 for the other subtypes.

[ * ]我们之前说过,IP 主机绝不会针对不属于其本地配置的子网之一的 IP 地址进行 ARP。请注意,此处描述的行为(即,主机对位于外部子网上的 IP 地址进行 ARP 解析)是可能的,因为重定向路由直接安装在路由缓存中,因此绕过了路由表(即,将使得无法对外部 IP 进行 ARP)。

[*] We said earlier that an IP host would never ARP for an IP address that does not belong to one of its locally configured subnets. Note that the behavior described here (i.e., a host that ARPs for an IP address located on a foreign subnet) is possible because the redirected route is installed directly in the routing cache, therefore bypassing the routing table (that is the one that would make it impossible to ARP a foreign IP).

[ ]有趣的是,IPv6 优化了 ICMP 重定向。在 IPv6 下,重定向包括建议的新网关的 L2 地址。因此,接收 ICMP REDIRECT 的主机无需解析 L3 到 L2 关联即可了解新网关的地址。

[] It is interesting to note that IPv6 has optimized ICMP redirects. Under IPv6, a redirect includes the L2 address of the suggested new gateway. So a host that receives an ICMP REDIRECT does not need to resolve the L3-to-L2 association to know the new gateway's address.

[ * ]在如图 31-10所示的场景中使用防火墙(所有主机共享 L2 广播域)无论如何都是一个糟糕的选择。但它可以让您了解为什么路由快捷方式并不总是可取的。

[*] The use of firewalls in scenarios like Figure 31-10, where all hosts share an L2 broadcast domain, would be a bad choice anyway. But it can give you an idea of why routing shortcuts are not always desirable.

第 32 章路由:Li nux 实现

Chapter 32. Routing: Li nux Implementation

第30章概述了路由子系统的主要任务,第31章 介绍了IP在基本路由功能之上实现的策略路由和多路径等功能。在本章中,我介绍了路由代码使用的主要数据结构。然后我展示:

Chapter 30 provided an overview of the main tasks of the routing subsystem, and Chapter 31 introduced the features such as Policy Routing and Multipath that IP implements on top of the basic routing functionality. In this chapter, I introduce the main data structures used by the routing code. I then show:

  • 如何定义路由和 IP 地址的范围

  • How scopes are defined for routes and IP addresses

  • 路由子系统是如何初始化的

  • How the routing subsystem is initialized

  • 路由子系统需要收到哪些类型的外部事件通知以保持其路由信息更新

  • What kind of external events the routing subsystem needs to be notified of to keep its routing information updated

后面的章节将详细介绍路由缓存、路由表和路由查找。

Later chapters will go into detail on the routing cache, routing tables, and routing lookups.

内核选项

Kernel Options

正如我们将在本章的其余部分看到的,路由不仅仅涉及在一个接口上接收数据包、查询路由表以及将数据包转发出正确的传出接口。同时还有许多额外的任务需要处理。Linux 内核中已经实现了相当多有趣的与路由相关的功能。除了我们将在本章后面看到的那些之外,还有许多其他子系统正在等待 Linus 或其他子系统所有者的批准,以将其集成到内核中。

As we will see in the rest of this chapter, routing does not involve just receiving a packet on one interface, consulting the routing table, and forwarding the packet out of the right outgoing interface. There are a number of additional tasks to take care of at the same time. Quite a few interesting routing-related features have been implemented in the Linux kernel. In addition to those we will see later in this chapter, many others are waiting for the green light from Linus or owners of other subsystems to be integrated into the kernel.

这里我简单介绍一下Linux内核中影响路由代码行为的特性,以免大家在阅读源代码时感到困惑。每个功能将在本章和后续章节的专门章节中进一步描述。

Here I briefly introduce the features of the Linux kernel that influence the behavior of the routing code so that you will not suffer confusion when you peruse the source code. Each feature is further described in dedicated sections in this and later chapters.

路由选项可以分为两类:

Routing options can be classified into two categories:

  • 始终支持的那些,只需要由用户启用或配置,例如通过/proc

  • The ones that are always supported, and that only need to be enabled or configured by the user, such as via /proc.

  • 通过使用正确的选项重新编译内核可以添加或删除其支持。选项CONFIG_WAN_ROUTER和菜单下的选项CONFIG_IP_ROUTE_MULTIPATH_CACHED 可以编译为模块;其他的必须编译到主内核中。

  • The ones whose support may be added or removed by recompiling the kernel with the right options. The CONFIG_WAN_ROUTER option and the options under the CONFIG_IP_ROUTE_MULTIPATH_CACHED menu can be compiled as modules; the others must be compiled into the main kernel.

我们将在本章的其余部分和后续章节中讨论这两类选项,但在本节中,我们将从编译时选项的概述开始。这些选项可以从 Networking Options 菜单中进行配置,如图32-1中的make xconfig快照所示 。

We will look at both categories of options in the rest of this and the following chapters, but in this section we will start with an overview of the compile-time options. These options can be configured from the Networking Options menu, as shown in the make xconfig snapshot in Figure 32-1.

内核配置(make xconfig)

图 32-1。内核配置(make xconfig)

Figure 32-1. Kernel configuration (make xconfig)

以下两部分包括CONFIG_ XXX与括号内的每个选项关联的内核符号。您可以使用该符号来标识仅当内核中包含对该功能的支持时才有条件执行的内核代码。但是,当特定功能使用整个文件时,您将无法CONFIG_ XXX在文件中找到。

The following two sections include the CONFIG_ XXX kernel symbol associated with each option within parentheses. You can use the symbol to identify the kernel code that is conditionally executed only when support for the feature is included in the kernel. When an entire file is used by a specific feature, though, you will not find CONFIG_ XXX in the file.

基本选项

Basic Options

以下是一些基本的路由选项。本章不涉及其中任何一个。

Here are a few basic routing options. None of them is covered in this chapter.

IP:组播路由(CONFIG_IP_MROUTE)
IP: multicast routing (CONFIG_IP_MROUTE)

IP:PIM-SM 版本 1 支持 ( CONFIG_IP_PIMSM_V1)

IP:PIM-SM 版本 2 支持 ( CONFIG_IP_PIMSM_V2)

如果启用 IP 多播路由,则可以有选择地启用协议独立多播 (PIM) 协议的两个版本之一内核支持。本书不涉及组播路由。

IP: PIM-SM version 1 support (CONFIG_IP_PIMSM_V1)

IP: PIM-SM version 2 support (CONFIG_IP_PIMSM_V2)

If you enable IP multicast routing, you can then selectively enable either of the two versions of the Protocol Independent Multicast (PIM) protocol supported by the kernel. Multicast routing is not covered in this book.

广域网路由器 (CONFIG_WAN_ROUTER)
WAN router (CONFIG_WAN_ROUTER)

此选项允许配置 X.25、帧中继、HDLC 和其他非 IP 协议以在 WAN 设备上执行路由。在内核配置菜单中,您可以在“网络设备支持→Wan 接口”下看到可用驱动程序列表。为了能够配置 WAN 设备,您需要下载一个通常不随最常见的 Linux 发行版默认提供的软件。[ * ]本书不涉及 WAN 路由。

This option allows configuration of X.25, Frame Relay, HDLC, and other non-IP protocols to perform routing on WAN devices . In the kernel configuration menu, you can see the list of available drivers under "Network device support → Wan interfaces." To be able to configure WAN devices you need to download a piece of software that normally does not come by default with the most common Linux distributions.[*] WAN routing is not covered in this book.

高级选项

Advanced Options

当您在“网络选项”菜单中启用“IP:高级路由器”选项时,您可以启用一些附加功能。在第 31 章中,我们已经介绍了每一项。以下是简要概述:

When you enable the "IP: Advanced router" option in the Networking Options menu, you can then enable a few additional features. In Chapter 31, we already introduced each one. Here is a brief overview:

IP:策略路由(CONFIG_IP_MULTIPLE_TABLES)
IP: policy routing (CONFIG_IP_MULTIPLE_TABLES)

在某些情况下,流量处理必须基于除目标 IP 地址之外的其他标准。策略路由是解决此限制的方法之一。在这些情况下,必须增强路由代码以考虑附加参数。请参阅第 31 章中的“策略路由背后的概念” 部分。

In some situations, traffic handling must be based on other criteria besides the destination IP address. Policy routing is one of the answers to this limitation. In these situations, the routing code must be enhanced to consider the additional parameters. See the section "Concepts Behind Policy Routing" in Chapter 31.

IP:使用netfilter MARK值作为路由键(CONFIG_IP_ROUTE_FWMARK)
IP: use netfilter MARK value as routing key (CONFIG_IP_ROUTE_FWMARK)

启用此选项后,路由表查找可以考虑防火墙设置的标记。请参阅第 31 章中的“策略路由和基于防火墙的分类器”部分。如果您首先启用“IP:策略路由”选项,您可以看到此选项。

When this option is enabled, routing table lookups can take into account a tag set by the firewall. See the section "Policy Routing and Firewall-Based Classifier" in Chapter 31. You can see this option if you first enable the "IP: policy routing" option.

IP:等价多路径(CONFIG_IP_ROUTE_MULTIPATH)
IP: equal cost multipath (CONFIG_IP_ROUTE_MULTIPATH)

有时,一条路由可以定义多个下一跳。在这种情况下,在所有路由上分配流量可能会增加总体带宽。这正是此功能的目的。在第 31 章的“多路径路由背后的概念”部分中,我们看到此功能的实现并不像看起来那么简单。

Sometimes a route can be defined with multiple next hops. In that case, distributing traffic over all the routes might increase overall bandwidth. That is exactly what this feature is about. In the section "Concepts Behind Multipath Routing" in Chapter 31, we saw that the implementation of this feature is not as trivial as it may look.

IP:具有缓存支持的等价多路径(CONFIG_IP_ROUTE_MULTIPATH_CACHED)
IP: equal cost multipath with caching support (CONFIG_IP_ROUTE_MULTIPATH_CACHED)

向路由缓存添加对多路径的支持。仅当先启用前一个选项时才能选择此选项。选择它后,您会看到一个子菜单,其中包含可用算法的列表,可用于从缓存路由中选择下一跳。请参阅第 31 章中的“多路径缓存支持” 部分。

Add support for Multipath to the routing cache. This option can be selected only if the previous one is first enabled. When you select it, you get a submenu with a list of the available algorithms that can be used for the selection of the next hop from the cached routes. See the section "Cache Support for Multipath" in Chapter 31.

IP:详细路由监控(CONFIG_IP_ROUTE_VERBOSE)
IP: verbose route monitoring (CONFIG_IP_ROUTE_VERBOSE)

有一些地方会检测到奇怪的情况,例如对处理的流量进行健全性检查时。在这些情况下,打印一些额外的警告消息可能会很有用;这就是此功能的目的。这些消息的输出速率限制为每五秒一次,以避免 DoS 攻击。

There are a few places where weird conditions are detected, as when doing a sanity check on the traffic processed. In those cases, it can be useful to have some extra warning messages printed; that is the purpose of this feature. The output of those messages is rate limited to one every five seconds to avoid DoS attacks.

最近放弃的选项

Recently Dropped Options

以下是 2.4 内核系列当前支持但 2.6 系列不包含的几个功能:

Following are a couple of features currently supported in the 2.4 kernel series, but not included in the 2.6 series:

IP:快速网络转换(CONFIG_IP_ROUTE_NAT)
IP: fast network translation (CONFIG_IP_ROUTE_NAT)

NAT 是一种通常配置在路由器上的功能,用于根据特定配置修改转发的 IP 数据包的源或目标 IP 地址。路由代码实现的NAT与防火墙代码实现的NAT无关,已被确定为多余。它的支持在内核版本 2.6.9 中被完全删除。

NAT is a feature typically configured on routers to modify the source or destination IP addresses of the forwarded IP packets according to a specific configuration. The NAT implemented by the routing code has nothing to do with the one implemented by the firewall code and has been determined to be superfluous. Its support was removed completely in kernel version 2.6.9.

快速切换(CONFIG_NET_FASTROUTE)
Fast switching (CONFIG_NET_FASTROUTE)

该功能允许直接在设备驱动层在NIC之间转发数据流量。数据包被转发到出接口,无需经过更高层 (IP),也无需查询路由表。在第30章图30-1中,该功能由驱动器盒内的虚线表示。目前只有一个 NIC 系列(Tulip 卡)支持该功能[ * ]

该功能在内核版本 2.6.8 中被删除,与其他重要功能不兼容,原因很简单,这种低级切换会绕过它们。此类功能的示例包括 Netfilter 防火墙、高级路由和虚拟设备(即绑定)。

This feature allows data traffic to be forwarded between NICs directly at the device driver layer. The packets are forwarded to the outgoing interface without having to pass through the higher layer (IP) and without any need to consult the routing table. In Figure 30-1 in Chapter 30, this feature is represented by the dotted line inside the driver box. The feature is currently supported by only one family of NICs, the Tulip cards .[*]

This feature, which was removed in kernel version 2.6.8, is not compatible with other important features for the simple reason that this low-level switching would bypass them. Examples of such features are the Netfilter firewall, advanced routing, and virtual devices (i.e., bonding).

高速接口间转发(CONFIG_NET_HW_FLOWCONTROL)
Forwarding between high-speed interfaces (CONFIG_NET_HW_FLOWCONTROL)

此功能允许网卡根据内存中缓冲区空间的可用性来启动和停止内核向其发送要传输的数据包。并非所有网卡都支持它。支持它的卡的一个示例是 Tulip 系列 ( drivers/net/tulip/* )。鉴于第 10 章中介绍的 NAPI ,这个有趣但几乎未使用的功能已在内核版本 2.6.10 中删除。

This feature allows network cards to start and stop the kernel from sending them packets to transmit, based on the availability of buffer space in their memory. Not all network cards support it. An example of cards that support it is the Tulip family (drivers/net/tulip/*). Given the introduction of NAPI, described in Chapter 10, this interesting but almost unused feature has been dropped in kernel version 2.6.10.

主要数据结构

Main Data Structures

路由代码使用大量相互引用的不同数据结构。要了解当前的路由代码和任何未来的改进,清楚地了解它们之间的关系非常重要。

The routing code uses a huge number of different data structures that reference each other. To understand the current routing code and any future improvements, it is important to see the relationships between them clearly.

任何代码的性能都会受到所使用的数据结构和整体代码设计的显着影响。对于内核代码尤其如此。因此,诸如路由之类的内核子系统是网络堆栈的核心,因此需要确保它不仅提供强大的功能,而且还要考虑性能。我们将在接下来的章节中看到本节中列出的数据结构如何组合在一起,以便更容易地实现从 CPU 和 RAM 消耗以及缓存的角度优化的算法。

Any code's performance can be significantly affected by the data structures used and the overall code design. This is especially true for kernel code. A kernel subsystem such as routing, which is the core of the network stack, therefore needs to make sure that it not only provides robust functionality, but also considers performance. We will see in the following chapters how the data structures listed in this section come together to make it easier to implement algorithms that are optimized from the point of view of CPU and RAM consumption, as well as caching.

下面的列表解释了路由代码定义和使用的主要数据结构。最重要的部分在第 36 章中有专门的章节,其中逐个字段进行了描述 。数据结构名称中的rtfib、 和fn前缀分别代表路由转发信息库函数

The following list explains the main data structures defined and used by the routing code. The most important ones have dedicated sections with field-by-field descriptions in Chapter 36. The rt, fib, and fn prefixes in the data structures' names stand for route, Forwarding Information Base, and function, respectively.

struct ip_rt_acct
struct ip_rt_acct

由基于路由表的分类器使用(参见第31章中的“基于路由表的分类器” 部分)保存统计信息,以字节和数据包数量表示,有关与标记关联的路由的流量。该结构包含一个计数器数组,每个处理器有 256 个元素。[ * ]大小为 256,因为路由标记的范围为 0 到 255。ip_rt_init对于 IPv4,向量由 分配;没有为 IPv6 分配任何内容。的四个字段ip_rt_acct在 中更新ip_rcv_finish请参阅第 35 章中的“策略路由和基于路由表的分类器”部分。

Used by the routing table based classifier (see the section "Routing Table Based Classifier" in Chapter 31) to keep statistics , expressed in both bytes and number of packets, about the traffic for routes that have been associated with a tag. The structure contains an array of counters, with 256 elements for each processor.[*] The size is 256 because route tags lie in the range from 0 to 255. The vector is allocated by ip_rt_init for IPv4; nothing is allocated for IPv6. The four fields of ip_rt_acct are updated in ip_rcv_finish. See the section "Policy Routing and Routing Table Based Classifier" in Chapter 35.

struct rt_cache_stat
struct rt_cache_stat

存储有关路由查找的统计信息。每个处理器都有一个该数据结构的实例。尽管顾名思义,只计算与路由表缓存相关的计数器,但仍有一些实例用于有关路由查找的更一般统计信息。请参阅第 36 章中的“统计”部分。

Stores statistics about routing lookups. There is one instance of this data structure for each processor. Even though the name suggests that only counters related to the routing table cache are counted, a few instances are used for more general statistics about routing lookups. See the section "Statistics" in Chapter 36.

struct inet_peer
struct inet_peer

维护有关远程 IP 对等点的长期信息。该数据结构在第 23 章的“长寿命 IP 对等信息”部分中进行了描述。

Maintains long-living information about remote IP peers. This data structure is described in the section "Long-Living IP Peer Information" in Chapter 23.

struct fib_result
struct fib_result

路由表中查找返回的结构。这些内容不仅仅代表下一跳,还包括一些所需的参数,例如策略路由所需的参数。

Structure returned by a lookup in the routing table. The contents do not simply represent the next hop but also include some more parameters that are needed, for instance, by policy routing.

struct fib_rule
struct fib_rule

表示策略路由用于选择要在路由流量中使用的路由表的规则。请参阅第 31 章中的“策略路由背后的概念”部分。

Represents the rules used by policy routing to select the routing table to use in routing traffic. See the section "Concepts Behind Policy Routing" in Chapter 31.

struct flowi
struct flowi

flowi在某种程度上类似于访问控制列表 (ACL):它根据选定的 L3 和 L4 标头字段的值(例如 IP 地址、L4 端口号等)定义流量聚合。例如,它的用途是:作为路由查找的搜索键。

flowi is to some extent similar to an Access Control List (ACL): it defines an aggregate of traffic based on the value of selected L3 and L4 header fields, such as IP addresses, L4 port numbers, etc. It is used, for example, as a search key for routing lookups.

以下数据结构是路由表的构建块。第 34 章更详细地描述了它们的关系。

The following data structures are the building blocks of routing tables. Their relationships are described in greater detail in Chapter 34.

struct fib_node
struct fib_node

路由表条目;用于存储生成的信息的数据结构,例如,使用命令route addip route add添加路由时。

A routing table entry; the data structure used to store the information generated, for example, when adding a route with the command route add or ip route add.

struct fn_zone
struct fn_zone

一个区域代表一组具有相同网络掩码长度的路由。由于网络掩码是 32 位值(对于 IPv4),因此每个路由表有 33 个区域。因此,到子网 10.0.1.0/24 和 10.0.2.0/24 的路由将进入 24 位区域列表(第 25 个区域),而到子网 10.0.3.128/25 的路由将进入 25 位区域列表列表。

A zone represents a set of routes with the same netmask length. Because the netmask is a 32-bit value (for IPv4), there are 33 zones for each routing table. Thus, routes to the subnets 10.0.1.0/24 and 10.0.2.0/24 would go into the 24-bit zone list (the 25th zone), and routes to the subnet 10.0.3.128/25 would go into the 25-bit zone list.

struct fib_table
struct fib_table

路由表。不要将其与路由表缓存混淆。

A routing table. Do not confuse it with the routing table cache.

struct fib_info
struct fib_info

一些参数可以在不同的路由表条目之间共享。这些参数存储在fib_info数据结构中。当新路由条目使用的参数集与已存在条目的参数集匹配时,现有fib_info 结构将被回收。引用计数跟踪用户数量。第 34 章中的图 34-1显示了一个示例。

Some parameters can be shared between different routing table entries. These parameters are stored in fib_info data structures. When the set of parameters used by a new routing entry match those of an already existing entry, the existing fib_info structure is recycled. A reference count keeps track of the number of users. Figure 34-1 in Chapter 34 shows an example.

struct fib_alias
struct fib_alias

通向同一目标网络但其他参数(例如 TOS)不同的路由通过fib_alias实例进行区分。

Routes that lead to the same destination network but differ with regard to other parameters, such as the TOS, are distinguished by means of fib_alias instances.

struct fib_nh
struct fib_nh

下一跳。如果您使用ip route add 10.0.0.0/24scope global nexthop via 192.168.1.1 等命令定义路由,则下一跳将为 192.168.1.1。一般情况下,一条路由只有一个下一跳,但是当内核编译了多路径特性后,就可以配置多个下一跳的路由。请参阅第 31 章中的“多路径路由背后的概念” 部分。

The next hop. If you define a route with a command such as ip route add 10.0.0.0/24 scope global nexthop via 192.168.1.1, the next hop will be 192.168.1.1. Normally there is only one next hop for a route, but when the multipath feature is compiled into the kernel, you can configure routes with more than one next hop. See the section "Concepts Behind Multipath Routing" in Chapter 31.

struct fn_hash
struct fn_hash

包含指向 33 个列表头部的指针fn_zone,以及一个将活动区域(至少具有一个元素的区域)链接在一起的列表。后者的元素按网络掩码长度递减排序。请参见第 34 章中的 图 34-1

Contains the pointers to the heads of the 33 fn_zone lists, and a list that links together the active zones (the ones with at least one element). The elements of the latter are sorted by decreasing netmask length. See Figure 34-1 in Chapter 34.

下一个结构块由协议路由缓存使用代码和协议无关的路由缓存代码(DST),第 33 章有更详细的描述:

The next block of structures is used by the protocol routing cache code and the protocol-independent routing cache code (DST ), described in more detail in Chapter 33:

struct dst_entry
struct dst_entry

路由表缓存条目 (DST) 的与协议无关的部分。适用于任何 L3 协议(例如 IPv4、IPv6、DECnet)的路由表缓存条目的字段被放入该结构中,然后该结构通常嵌入到 L3 协议使用的数据结构中以表示路由表缓存条目。

The protocol-independent part of the routing table cache entries (DST). Fields of the routing table cache entries that apply to any L3 protocols (e.g., IPv4, IPv6, DECnet) are put into this structure, which is then normally embedded in the data structures used by the L3 protocols to represent a routing table cache entry.

struct dst_ops
struct dst_ops

DST 核心代码使用虚拟功能表 (VFT) 来通知协议特定事件(例如链路故障)。每个 L3 协议都提供自己的一组函数,以按照其喜欢的方式处理这些事件。并非所有协议都使用 VFT 的所有字段。请参见第33 章中的“ DST 和呼叫协议之间的接口”部分。

Virtual function table (VFT) used by the DST core code to notify the protocol of specific events (for instance, link failures). Each L3 protocol provides its own set of functions to handle those events in the way it prefers. Not all of the fields of the VFT are used by all of the protocols. See the section "Interface Between the DST and Calling Protocols" in Chapter 33.

struct rtable
struct rtable

IPv4 使用它来表示路由表缓存条目。[ * ]

Used by IPv4 to represent a routing table cache entry.[*]

配置中常用以下结构:

The following structures are commonly used in configuration:

struct kern_rta
struct kern_rta

当内核从用户空间中的IPROUTE2命令接收到添加或删除路由的请求时,它会解析该请求并将其存储到结构中kern_rta请参阅第 36 章中的“ inet_rtm_newroute 和 inet_rtm_delroute 函数”部分。

When the kernel receives a request to add or delete a route from an IPROUTE2 command in user space, it parses the request and stores it into a kern_rta structure. See the section "inet_rtm_newroute and inet_rtm_delroute functions" in Chapter 36.

struct rtentry
struct rtentry

当向内核发送添加或删除路由的请求时,由route命令使用。IPROUTE2 的ip Route命令使用不同的数据结构。

Used by the route command when sending the kernel a request to add or delete a route. IPROUTE2's ip route command uses a different data structure.

struct fib_iter_state
struct fib_iter_state

存储浏览组成路由表的数据结构实例时使用的上下文信息。它由处理/proc接口的代码使用。

Stores context information used while browsing the data structure instances that compose a routing table. It is used by the code that handles the /proc interface.

下一个数据结构块由多路径缓存使用 功能,在第 33 章 中描述:

The next block of data structures is used by the multipath caching feature, described in Chapter 33:

struct ip_mp_alg_ops
struct ip_mp_alg_ops

多路径缓存算法。它由函数指针组成。

Multipath caching algorithm. It consists of function pointers.

struct multipath_device
struct multipath_device

由设备循环缓存算法使用来保存有关设备的信息。

Used by the device round-robin caching algorithm to keep information about a device.

struct multipath_candidate
struct multipath_candidate

struct multipath_dest
struct multipath_dest

struct multipath_bucket
struct multipath_bucket

struct multipath_route
struct multipath_route

被加权随机缓存算法用来保存算法所需的状态信息。

Used by the weighted random caching algorithm to keep the state information needed by the algorithm.

列表和哈希表

Lists and Hash Tables

路由代码使用另外两个通用数据结构。我们会在本书的这一部分中经常看到它们,因此值得对它们进行一些介绍。

Two other general-purpose data structures are used by the routing code. We will see them often in this part of the book, so they deserve a little introduction.

hlist_head
hlist_head

hlist_node
hlist_node

哈希表是用这两种数据结构类型实现的。表的存储桶定义为 type ,添加到表中的实际元素嵌入了用于将它们链接到表的hlist_headtype 元素 。hlist_node两种类型之间的唯一区别是hlist_head仅包含前向指针,而同时hlist_node包含前向和后向指针。

Hash tables are implemented with these two data structure types. The buckets of the table are defined as type hlist_head, and the actual elements that are added to the table embed an element of type hlist_node that is used to link them to the table. The only difference between the types is that hlist_head includes only a forward pointer, whereas hlist_node has both forward and back pointers.

因此,哈希表桶中的列表是双向的。因为头部没有向后指针,所以链表不是循环的,所以到达链表尾部的代价是昂贵的,但是对于哈希表来说这不是问题。通过将向后指针保留在存储桶的头部之外,此实现将存储桶的大小减少了 50%,因此在相同的内存量下可以存储的存储桶数量增加了一倍。

Thus, the lists in hash table buckets are bidirectional. Because the head does not have a backward pointer, the list is not circular, so it is expensive to reach the tail of the list, but for a hash table this is not a problem. By leaving the backward pointer out of the bucket's head, this implementation reduces the size of the bucket by 50%, and therefore doubles the number of buckets it can store with the same amount of memory.

图 32-2hlist_head显示了用和结构构建的哈希表的示例hlist_node。请注意,该hlist_node结构不一定需要是它链接的结构的第一个字段。

Figure 32-2 shows an example of a hash table built with hlist_head and hlist_node structures. Note that the hlist_node structure does not necessarily need to be the first field of the structure it links.

并非所有路由代码定义的哈希表都使用图 32-2中的模型:在第 34 章中,您将看到一些涉及路由表及其数据结构组织的使用示例;在第33章中,你将看到路由缓存使用它自己的定义。

Not all hash tables defined by the routing code use the model in Figure 32-2: in Chapter 34, you will see a few examples of use involving the routing tables and the organization of its data structures; in Chapter 33, you will see that the routing cache uses its own definition instead.

头元素中带有两个指针的列表也可以使用list_head结构来实现。您可以在include/linux/list.h中找到有关这两种列表的更多详细信息。它包括最常见的操作例程的定义(添加、删除、浏览等)。

Lists with two pointers in the head element can also be implemented, using list_head structures. You can find more details about both kinds of lists in include/linux/list.h. It includes definitions of the most common manipulation routines (add, remove, browse, etc.).

路由和地址范围

Route and Address Scopes

第30章“范围”一节介绍了范围的概念。让我们看看内核如何定义该部分中描述的范围,并查看它们的一些使用示例。

The concept of scope was introduced in the section "Scope" in Chapter 30. Let's see here how the scopes described in that section are defined by the kernel, and see some examples of their use.

通用哈希表和列表的使用

图 32-2。通用哈希表和列表的使用

Figure 32-2. Generic hash table and use of lists

内核定义了一个枚举,在include/linux/rtnetlink.hrt_scope_t中列出了可能的范围。其值范围为 0 到 255,其中 0 ( ) 代表最宽的范围。内核实际上只使用了几个值。其他的由用户自行决定;目前,它们还没有实际用途。RT_SCOPE_UNIVERSE

The kernel defines an rt_scope_t enum that lists possible scopes in include/linux/rtnetlink.h. Its values range from 0 to 255, where 0 (RT_SCOPE_UNIVERSE) represents the broadest scope. The kernel actually uses only a few values. The others are left to the discretion of the user; at the moment, there are no practical uses for them.

路线范围

Route Scopes

范围fa_scope数据结构字段中保存一条路线fib_alias(参见第34章34-1)。以下是 IPv4 路由代码使用的主要范围(按范围递增的顺序排列):

The scope of a route is saved in the fa_scope field of the fib_alias data structure (see Figure 34-1 in Chapter 34). Here are the main scopes used by the IPv4 routing code, in order of increasing scope:

RT_SCOPE_NOWHERE
RT_SCOPE_NOWHERE

该值未在第 30 章中列出,被代码视为非法。字面意思就是路线不通向任何地方,基本上就是没有到达目的地的路线。

This value, which was not listed in Chapter 30, is treated by the code as illegal. The literal meaning is that the route does not lead anywhere, which basically means there is no route to the destination.

RT_SCOPE_HOST
RT_SCOPE_HOST

这些路由的示例是在本地接口上配置 IP 地址时自动创建的路由(请参阅“添加 IP 地址”部分)。

Examples of these routes are the ones created automatically when configuring IP addresses on the local interfaces (see the section "Adding an IP address").

RT_SCOPE_LINK
RT_SCOPE_LINK

这包括到本地网络(由网络掩码定义)的路由以及从本地配置的地址派生到子网广播地址的路由(请参阅“添加 IP 地址”部分)。

This includes routes to the local network (as defined by the netmask) and to the subnet broadcast addresses derived from locally configured addresses (see the section "Adding an IP address").

RT_SCOPE_UNIVERSE
RT_SCOPE_UNIVERSE

这用于通向未直接连接的远程目的地的所有路由(即需要下一跳网关的路由)。

This is used for all routes that lead to remote destinations not directly connected (i.e., the ones that require a next-hop gateway).

地址范围

Address Scopes

地址的范围保存在结构ifa_scope体的字段中in_ifaddrin_ifaddr设备上配置的每个 IP 地址都有一个实例(请参阅第 19 章)。我们在第 30 章中看到了每个主作用域的地址示例。

The scope of an address is saved in the ifa_scope field of the in_ifaddr structure. There is an in_ifaddr instance for each IP address configured on a device (see Chapter 19). We saw examples of addresses for each main scope in Chapter 30.

路由中的下一跃网关是分配了范围的另一种对象类型。每条路由可以分配零个、一个或多个下一跳。每个下一跳都用一个结构体实例来定义fib_nh(参见第34章中的图34-1)。其中两个字段是和:是下一跳网关的 IP 地址,是该地址的范围(包括从本地主机到达下一跳网关所需的路由范围)。fib_nhnh_gwnh_scopenh_gwfib_scope

The next-hop gateway in a route is another object type that is assigned a scope. Each route can be assigned zero, one, or multiple next hops. Each next hop is defined with an instance of a fib_nh structure (see Figure 34-1 in Chapter 34). Two of the fib_nh's fields are nh_gw and nh_scope: nh_gw is the IP address of the next-hop gateway, and fib_scope is the scope of that address (which consists of the scope of the route needed to reach the next-hop gateway from the local host).

路由与下一跳范围的关系

Relationship Between Route and Next-Hop Scopes

虽然路由的范围和本地配置的地址的范围要么由用户显式设置,要么由内核分配默认值,但路由的下一跳(结构)的范围始终由内核分配fib_nh

While the scope of a route and the scope of locally configured addresses are either explicitly set by the user or assigned a default value by the kernel, the scope of a route's next hop (fib_nh structure) is always assigned by the kernel.

在第34章的“添加路由”部分中,您将看到如何设法将新路由插入路由表中。这里只需说 用于为新路由分配必要的数据结构并初始化下一跳的范围。最后一项任务是在的帮助下 完成的。下一跳的范围源自正在配置的路由的范围:通常,给定路由及其下一跳,分配给的值fn_hash_insertfn_hash_insertfib_create_infofib_create_infofib_check_nhnh_scopenh_scope是用于到达下一跳的路由范围。有些特殊情况需要不同的规则,例如到本地配置的地址的路由和其他不包括下一跳的直接路由。

In the section "Adding a Route" in Chapter 34, you will see how fn_hash_insert manages to insert a new route into a routing table. Here it suffices to say that fn_hash_insert uses fib_create_info to allocate the necessary data structures for the new route and to initialize the next hop's scope. This last task is taken care of by fib_create_info, with the help of fib_check_nh. The next hop's nh_scope scope is derived from the scope of the route being configured: normally, given a route and its next hop, the value assigned to nh_scope is the scope of the route that would be used to reach the next hop. There are special cases that require different rules, such as routes to locally configured addresses and other direct routes, which do not include a next hop.

现在我们知道了如何nh_scope初始化,让我们看看如何使用它的值来对路由执行健全性检查。

Now that we know how nh_scope is initialized, let's see how its value can be used to enforce sanity checks on the routes.

路由代码对不同位置的路由范围和下一跳强制执行健全性检查。大多数健全性检查都是基于路由范围与其下一跳范围之间关系的一个有趣属性。当主机转发IP数据包时,它应该更接近最终目的地。[ * ]基于这个简单的属性,路由的范围应始终大于或等于路由使用的下一跳的范围。

The routing code enforces sanity checks on the scopes of routes and next hops in different places. Most of those sanity checks are based on an interesting property of the relationship between the scope of a route and the scope of its next hops. When a host forwards an IP packet, it is supposed to get closer to the final destination.[*] Based on this simple property, it follows that the scope of a route should always be greater than or equal to the scope of the next hop used by the route.

让我们看几个使用图 32-3中的拓扑的示例。请记住,对于每条路由,nh_scope是下一跃点的范围,nh_gw是下一跃点的 IP 地址。

Let's see a couple of examples using the topology in Figure 32-3. Remember that for every route, nh_scope is the next hop's scope and nh_gw is the next hop's IP address.

  • 当主机A向主机C发送数据包时,匹配的路由具有范围RT_SCOPE_UNIVERSE,并且使用的下一跳是RT。到 RT 的路由范围必须比要RT_SCOPE_UNIVERSE收敛的路由范围窄。因此,在路由表中查找从主机 A 到主机 RT 的路由将返回范围为 的路由 RT_SCOPE_LINK,该路由的范围比 窄 RT_SCOPE_UNIVERSE,因此是正确的。因为RT_SCOPE_LINK路由不需要网关(事实上,主机 A 不需要网关来到达主机 RT),所以内核初始化nh_gw为 0,nh_scope范围初始化为小于RT_SCOPE_LINK(例如,RT_SCOPE_HOST)。

  • When Host A sends a packet to Host C, the matching route has scope RT_SCOPE_UNIVERSE and the next hop to use is RT. The scope of the route to RT must be narrower than RT_SCOPE_UNIVERSE for the routing to converge. Thus, a lookup in the routing table for a route from Host A to Host RT will return a route with scope RT_SCOPE_LINK, which is a narrower scope than RT_SCOPE_UNIVERSE and therefore correct. Because an RT_SCOPE_LINK route does not need a gateway (and in fact you do not need a gateway for Host A to reach Host RT), the kernel initializes nh_gw to 0 and the nh_scope scope to a scope smaller than RT_SCOPE_LINK (e.g., RT_SCOPE_HOST).

  • 当主机A向自己发送数据包时,匹配的路由具有作用域RT_SCOPE_HOST。在这种情况下,您不需要网关,因此nh_gw设置为 0。nh_scope设置为小于RT_SCOPE_HOST:的范围RT_SCOPE_NOWHERE

  • When Host A sends a packet to itself, the matching route has scope RT_SCOPE_HOST. In this case, you do not need a gateway, so nh_gw is set to 0. nh_scope is set to a scope smaller than RT_SCOPE_HOST: RT_SCOPE_NOWHERE.

当路由查找的结果是直接路由(即不需要下一跳网关)时,刚刚描述的递归结束。以下是两种可能的情况:

The recursion just described ends when the result of a routing lookup is a direct route (i.e., no next-hop gateway is necessary). Here are the two possible cases:

路由查找返回的路由有作用域 RT_SCOPE_HOST
The route returned by the routing lookup has scope RT_SCOPE_HOST

在这种情况下,目的地是本地配置的地址,因此主机可以在本地传送数据包。

In this case, the destination is a locally configured address, so the host can deliver the packet locally.

路由查找返回的路由有作用域 RT_SCOPE_LINK
The route returned by the routing lookup has scope RT_SCOPE_LINK

由于目的地是直连的,不需要网关,因此主机可以使用二层协议将数据包直接发送到目的地。[ * ]

Because the destination is directly connected and there is no need for a gateway, the host can send the packet to the destination directly using an L2 protocol.[*]

下一跳范围初始化示例

图 32-3。下一跳范围初始化示例

Figure 32-3. Example of initialization of next hop's scopes

主要和次要 IP 地址

Primary and Secondary IP Addresses

我们在第 30 章的“主要和次要地址”部分中看到 ,IP 地址可以在设备上配置为主要或次要地址。内核通常需要浏览设备上配置的所有地址,以找到与给定条件匹配的地址。我们来看看如何区分这两类地址以及如何浏览地址。

We saw in the section "Primary and Secondary Addresses" in Chapter 30 that an IP address can be configured as primary or secondary on a device. The kernel often needs to browse all the addresses configured on a device to find one that matches a given condition. Let's see how the two types of addresses are distinguished and how addresses are browsed.

IFA_F_SECONDARY辅助 IPv4 地址在其数据结构中标有该标志in_ifaddr(请参阅第 19 章)。由于只有两种配置(主要配置和辅助配置),因此不需要标志IFA_F_PRIMARY:如果地址不是辅助地址,则将其视为主要地址。

Secondary IPv4 addresses are tagged with the IFA_F_SECONDARY flag in their in_ifaddr data structures (see Chapter 19). Because there are only two configurations—primary and secondary—there is no need for an IFA_F_PRIMARY flag: if an address is not secondary, it is considered primary.

内核在include/linux/inetdevice.h中提供了宏,可以更轻松地浏览满足特定条件的接口。对于选择的每个标准,通常有两个宏用于包围循环:程序员将处理单个地址的代码放置在由一个宏启动并由另一个宏终止的块中。效果是运行一个循环,将代码应用于满足选定条件的每个地址。

The kernel provides macros in include/linux/inetdevice.h that make it easier to browse interfaces meeting specific criteria. For each criterion selected, there are usually two macros that are used to bracket a loop: the programmer places the code to process a single address in a block that is started by one macro and terminated by another. The effect is to run a loop applying the code to each address meeting selected criteria.

以下是正在使用的宏的示例:

Here is an example of the macros in use:

for_ifa {
 
        do something with ifa
 
} endfor_ifa
for_ifa {
 
        do something with ifa
 
} endfor_ifa

for_ifa宏使用变量启动一个循环 ifa来表示每个选定的地址。宏之间的代码不需要放在括号中,但通常是这样,以使ifalocal 和 等变量只能在循环内使用。

The for_ifa macro starts a loop with the variable ifa to represent each address selected. The code between the macros does not need to be placed in brackets, but it usually is, to make variables such as ifa local and usable only within the loop.

一些这样的宏包括:

A few such macros include:

for_ifa
for_ifa

endfor_ifa
endfor_ifa

给定一个设备,这两个宏可用于浏览其所有in_device数据结构。[ * ]

Given a device, these two macros can be used to browse all of its in_device data structures.[*]

for_primary_ifa
for_primary_ifa

endfor_ifa
endfor_ifa

对于给定的设备,这两个宏可用于有选择地仅浏览 in_device与主 IP 地址关联的实例。

Given a device, these two macros can be used to selectively browse only the in_device instances associated with primary IP addresses.

通用辅助例程和宏

Generic Helper Routines and Macros

路由代码使用了相当多的小例程和宏,使代码更具可读性。本节列出了一些通用的;更专业的将在稍后的“助手例程”部分中介绍。重要的是要记住,相同的函数或宏可以有不同的定义,具体取决于以下因素:

The routing code uses quite a few small routines and macros that make the code more readable. This section lists some of the generic ones; more-specialized ones will be introduced later in the section "Helper Routines." It is important to keep in mind that the same function or macro can have different definitions, depending on such factors as:

  • 内核中对策略路由的支持

  • Support for policy routing in the kernel

  • 内核中对多路径路由的支持

  • Support for multipath routing in the kernel

  • L3 协议(例如 IPv4 与 DECnet)

  • L3 protocol (e.g., IPv4 versus DECnet)

通用例程如下:

The generic routines follow:

FIB_RES_ XXX
FIB_RES_ XXX

给定一个fib_result结构,这些宏提取特定字段。例如,FIB_RES_DEV 提取nh_dev字段。这些宏在include/net/ip_fib.h中定义。

Given a fib_result structure, these macros extract specific fields. For example, FIB_RES_DEV extracts the nh_dev field. These macros are defined in include/net/ip_fib.h.

change_nexthops
change_nexthops

for_nexthops
for_nexthops

endfor_nexthops
endfor_nexthops

用于浏览fib_nh给定实例的所有结构fib_infochange_nexthops在结构上启动循环,指定局部变量nh来表示每个结构;正如宏的名称所暗示的,它可以用来改变结构。for_nexthops非常相似并且以相同的 endfor_nexthops宏结尾。唯一的区别是,for_nexthop将局部变量定义 nh为指向常量的指针,因此循环内的代码无法更改所fib_nh浏览的任何实例的内容。

对于 IPv4,这些宏在net/ipv4/fib_semantics.c中定义。请注意,每个宏都有两个版本:一种在内核中有策略路由支持时使用,另一种在没有策略路由时使用。第二个优化是考虑到在没有策略路由的情况下,每个实例始终最多有一个fib_nh实例fib_info(即,每个路由最多有一个下一跃点)。

Used to browse all the fib_nh structures of a given fib_info instance. change_nexthops starts a loop over the structures, designating the local variable nh to represent each structure; as the name of the macro suggests, it can be used to alter the structures. for_nexthops is very similar and ends with the same endfor_nexthops macro. The only difference is that for_nexthop defines the local variable nh as a pointer to a constant and therefore the code inside the loop cannot change the content of any of the fib_nh instances browsed.

For IPv4, these macros are defined in net/ipv4/fib_semantics.c. Note that for each macro there are two versions: one used when there is Policy Routing support in the kernel and one when there is no Policy Routing. The second one is optimized by taking into account that without Policy Routing you always have at most one fib_nh instance per fib_info instance (that is, at most one next hop per route).

inet_ifa_byprefix
inet_ifa_byprefix

给定设备、前缀和掩码,此函数会浏览输入设备上配置的所有主 IP 地址,查找与输入前缀和掩码匹配的地址。如果成功,它将返回匹配的地址。

Given a device, a prefix, and a mask, this function browses all the primary IP addresses configured on the input device looking for an address that matches the input prefix and mask. In case of success, it returns the address that matches.

fib_get_table
fib_get_table

给定一个路由表标识符(0到255之间的数字),该函数fib_info第34章图34-1fib_tables所示的数组中返回关联的结构。它在include/net/ip_fib.h中定义。

Given a routing table identifier (a number from 0 to 255), this function returns the associated fib_info structure from the fib_tables array shown in Figure 34-1 in Chapter 34. It is defined in include/net/ip_fib.h.

fib_new_table
fib_new_table

该函数创建并初始化一个新的路由表,并将其链接到 fib_tables向量(参见第34章中的图34-1)。

This function creates and initializes a new routing table and links it to the fib_tables vector (see Figure 34-1 in Chapter 34).

LOOPBACK
LOOPBACK

ZERONET
ZERONET

MULTICAST
MULTICAST

LOCAL_MCAST/BADCLASS
LOCAL_MCAST/BADCLASS

这些宏定义在include/linux/in.h中,用于快速对一些众所周知的 IP 地址类别进行分类。请参见第 30 章中的表 30-130-2

LOOPBACK标识 127.xxx地址

ZERONET标识0.xxx/8地址,这在大多数情况下是不合法的。

MULTICAST标识 D 类范围内的地址。

LOCAL_MCAST标识用于本地多播的 D 类范围的子集:224.0.0.0/24。

BADCLASS标识 E 类范围内的地址。

These macros, defined in include/linux/in.h, are used to quickly classify some well-known categories of IP addresses. See Tables 30-1 and 30-2 in Chapter 30.

LOOPBACK identifies 127.x.x.x addresses.

ZERONET identifies 0.x.x.x/8 addresses, which in most cases are not legal.

MULTICAST identifies addresses in the class D range.

LOCAL_MCAST identifies a subset of the class D range used for local multicast: 224.0.0.0/24.

BADCLASS identifies addresses in the class E range.

全局锁

Global Locks

路由代码使用一些锁来防止竞争条件。以下列表仅包含全局锁;那些嵌入在数据结构中(即应用于单个条目)的内容将在相关的数据结构描述中得到解决。

The routing code uses a few locks for protection against race conditions. The following list includes only global locks; those that are embedded in the data structure (i.e., applied to single entries) will be addressed in the associated data structure descriptions.

fib_hash_lock
fib_hash_lock

此读写自旋锁 (rwlock) 保护所有路由表。例如,插入新fib_node 实例需要以独占模式获取锁,而路由表查找则需要仅以共享模式获取锁。由于所有路由表只有一把锁,这意味着不可能同时向两个不同的路由表添加两个路由条目。然而,这并不真正代表瓶颈,因为配置更改很少发生,并且用户可以使用共享锁,而不会对路由器性能产生任何重大影响。

This read-write spin lock (rwlock) protects all the routing tables. For instance, the insertion of a new fib_node instance requires the lock to be taken in exclusive mode, and a routing table lookup requires the lock to be acquired just in shared mode. Since there is only one lock for all the routing tables, it means that it is not possible to add two routing entries to two distinct routing tables at the same time. However, this does not really represent a bottleneck, because configuration changes are rare events and the user can live with a shared lock without any major impact on router performance.

fib_info_lock
fib_info_lock

rwlock可以保护所有fib_info数据结构。例如,当通过第 34 章“ fib_info 结构的组织fib_info部分中描述的哈希表访问结构时,会使用它。

This rwlock protects all the fib_info data structures. It is used, for instance, when accessing fib_info structures through the hash tables described in the section "Organization of fib_info Structures" in Chapter 34.

fib_rules_lock
fib_rules_lock

此 rwlock 保护数据结构fib_rules的全局列表fib_rule

This rwlock protects the fib_rules global list of fib_rule data structures.

rt_flush_lock
rt_flush_lock

该自旋锁用于保护 全局变量和计时器rt_cache_flush的操作。缓存受到每桶锁的保护。请参见图33-1和第 33 章中的 “刷新路由缓存”部分。rt_deadlinert_flush_timer

This spin lock is used by rt_cache_flush to protect the manipulation of the rt_deadline global variable and the rt_flush_timer timer. The cache is protected by the per-bucket locks. See Figure 33-1 and the section "Flushing the Routing Cache," both in Chapter 33.

fib_multipath_lock
fib_multipath_lock

fib_info当修改多路径功能使用的结构字段时,将使用此自旋锁。

This spin lock is used when modifying fields of the fib_info structure that are used by the multipath feature.

alg_table_lock
alg_table_lock

该自旋锁序列化对向量的访问ip_mp_alg_tablemultipath_alg_register它由和函数使用multipath_alg_unregister请参阅第 33 章中的“注册缓存算法” 部分。

This spin lock serializes access to the ip_mp_alg_table vector. It is used by the multipath_alg_register and multipath_alg_unregister functions. See the section "Registering a Caching Algorithm" in Chapter 33.

路由子系统初始化

Routing Subsystem Initialization

初始化IPv4 路由代码在net/ipv4/route.c中以 开头ip_rt_init,在启动时由 IP 子系统ip_initnet/ipv4/ip_output.c中初始化时调用。在第 19 章ip_init中进行了描述;在这里我们将会看到。[ * ]图 32-4显示了如何调用主路由初始化例程。ip_rt_init

The initialization of the IPv4 routing code starts in net/ipv4/route.c with ip_rt_init, which is called by the IP subsystem when it is initialized with ip_init in net/ipv4/ip_output.c at boot time. ip_init is described in Chapter 19; here we will see ip_rt_init. [*] Figure 32-4 shows how the main routing initialization routines are invoked.

主要路由初始化函数的调用顺序

图 32-4。主要路由初始化函数的调用顺序

Figure 32-4. Sequence of calls for the main routing initialization functions

在 中ip_rt_init,IPv4 路由代码初始化其数据结构和全局变量。除其他外,该功能:

In ip_rt_init, the IPv4 routing code initializes its data structures and global variables. Among other things, the function:

有两个例程特别令人感兴趣:

Two routines are of particular interest:

ip_fib_init
ip_fib_init

初始化默认路由表并使用两个通知链[ * ] netdev_chain和注册两个处理程序(请参阅“外部事件inetaddr_chain”部分)。

Initializes the default routing tables and registers two handlers with the two notification chains[*] netdev_chain and inetaddr_chain (see the section "External Events").

devinet_init
devinet_init

向通知链注册另一个处理程序,向 Netlink 套接字 注册地址和路由命令(即ip addr ...ip route ...netdev_chain )的处理程序,并创建/proc/sys/net/conf/ proc/sys/net/conf/default目录。有关最后两项任务,请参阅第 36 章。

Registers another handler with the notification chain netdev_chain, registers the handlers for the address and route commands (i.e., ip addr ... and ip route ...) with the Netlink socket, and creates the /proc/sys/net/conf and /proc/sys/net/conf/default directories. See Chapter 36 for the last two tasks.

当内核编译为支持 IPsec 时,ip_rt_init还会调用一些 IPsec 初始化例程 (xfrm_initxfrm4_init)。

When the kernel is compiled with support for IPsec, ip_rt_init also invokes a couple of IPsec initialization routines (xfrm_init and xfrm4_init).

有关如何初始化全局变量的详细信息,请参阅第 33 章中的“路由缓存初始化”部分。rt_hash_ xxxip_rt_init

See the section "Routing Cache Initialization" in Chapter 33 for the details on how the rt_hash_ xxx global variables are initialized by ip_rt_init.

策略路由使用 进行初始化,在net/ipv4/fib_rules.cfib_rules_init中定义。初始化仅包括向通知链注册处理程序。注册的处理程序是,并且在“对策略数据库的影响”部分中进行了描述。netdev_chainfib_rules_event

Policy routing is initialized with fib_rules_init, defined in net/ipv4/fib_rules.c. The initialization consists simply of registering a handler with the netdev_chain notification chain. The registered handler is fib_rules_event, and is described in the section "Impacts on the policy database."

外部事件

External Events

路由子系统在网络堆栈中起着核心作用。因此,它需要知道何时发生可能影响路由表和路由缓存的更改。网络拓扑的更改由用户空间中运行的可选路由协议负责。另一方面,本地主机配置的更改需要内核的关注。

The routing subsystem plays a central role in the network stack. Because of this, it needs to know when changes take place that may affect the routing table and routing cache. Changes to the network topology are taken care of by optional routing protocols running in user space. On the other hand, changes to the local host configuration require kernel attention.

特别是,路由子系统对两种事件感兴趣:

In particular, the routing subsystem is interested in two kinds of events:

  • 网络设备状态变化

  • Changes in the status of a network device

  • 网络设备上 IP 配置的更改

  • Changes in IP configuration on a network device

为了在这些发生时接收通知,路由子系统分别向netdev_chaininetaddr_chain通知链注册。“设备状态的更改”和“ IP 配置的更改”部分更详细地介绍了为两类事件注册的处理程序。

To receive notifications when these take place, the routing subsystem registers with the netdev_chain and inetaddr_chain notification chains, respectively. The sections "Changes in Device Status" and "Changes in IP Configuration" go into more detail on the handlers registered for the two classes of events.

图 32-5显示了在“对路由表的影响”和“对 IP 配置的影响ip_rt_init”部分中注册和描述的两个处理程序的高级描述。某些事件是通过使用不同的输入参数调用某些帮助例程来处理的。其中一些例程将在接下来的“助手例程”部分中进行描述。图32-5所示参数的含义参见该节的说明。fib_sync_downforce

Figure 32-5 shows a high-level description of the two handlers registered in ip_rt_init and described in the sections "Impacts on the routing tables" and "Impacts on the IP configuration." Some of the events are handled by calling certain helper routines with varying input parameters. Some of those routines are described in the upcoming section "Helper Routines." See the description of fib_sync_down in that section for the meaning of the force parameter shown in Figure 32-5.

我们将看到多种事件可以刷新路由缓存。有关此类事件的完整列表,请参阅第 33 章中的“刷新路由缓存”部分。

We will see that a variety of events can flush the routing cache. Refer to the section "Flushing the Routing Cache" in Chapter 33 for a complete list of such events.

辅助例程

Helper Routines

在下面的部分中,我们将详细了解fib_netdev_eventfib_inetaddr_event是如何实现的。本节概述了这两个处理程序调用的一些例程;在阅读处理程序本身时,您可以使用本节作为参考。

In the following sections, we will see in detail how fib_netdev_event and fib_inetaddr_event are implemented. This section gives an overview of some of the routines called by those two handlers; you can use this section as a reference when reading about the handlers themselves.

void rt_cache_flush(int delay)
void rt_cache_flush(int delay)

安排在给定时间后刷新路由缓存,该时间由输入参数指定。请参阅第 33 章中的“刷新路由缓存” 部分。

Schedules a flush of the routing cache after a given amount of time, which is specified with an input parameter. See the section "Flushing the Routing Cache" in Chapter 33.

int fib_sync_down(u32 local, struct net_device *dev, into force)
int fib_sync_down(u32 local, struct net_device *dev, into force)

当设备关闭或本地地址被删除时更新路由表。输入参数的含义如下:

local

已删除的 IP 地址。

dev

已关闭的设备。

force

确定何时执行某些活动。请参阅图 32-5了解何时使用以下每个值。含义如下。

0:IP地址已被删除。

1:设备已关闭。

2:设备已被注销。

参数force超载;它用于决定两件事:

fib_netdev_event 和 fib_inetaddr_event 函数

图 32-5。fib_netdev_event 和 fib_inetaddr_event 函数

删除路由的范围

force为 0 时,处理程序将删除所有符合条件的路由,除了通向本地配置地址(即 range RT_SCOPE_HOST)的路由之外。当force为 1 或 2 时,处理程序将删除所有符合条件的路由,无论范围如何。

如何处理多路径路由

force为 2 时,如果处理程序的下一跳中至少有一个使用输入设备,则处理程序将删除多路径路由 dev。当force为 0 或 1 时,仅当所有下一跳都已失效时,处理程序才会删除多路径路由。

fib_sync_down通常调用一次只处理一种类型的事件,因此设置devlocal参数,另一个为空。

local提供后,删除所有用作首选源地址的fib_sync_down路由。local请记住,可以为路由分配首选地址,而不一定是在路由的出口设备上配置的地址。

dev提供时,fib_sync_down删除所有其下一跳可通过 到达的路由dev

在这两种情况下,路由都不会被直接删除;它们只是通过设置RTNH_F_DEAD标志被标记为死亡(不可用)。仅当多路径路由的所有下一跳都被标记为死亡时,该多路径路由才会被标记为死亡。另外,当多路径路由的下一跳被标记为死亡时,参数fib_powernh_power也必须更新,以反映当前下一跳的状态(参见第35章中的“下一跳选择”部分)。

fib_info 返回值是被 标记为死亡的结构的数量fib_sync_down。例如,调用者(例如本节稍后描述的 )使用此值fib_disable_ip来决定是否刷新路由缓存。

Updates the routing tables when a device is shut down or a local address is removed. Here is the meaning of the input parameters:

local

IP address that has been removed.

dev

Device that has been shut down.

force

Determines when certain activities are performed. Refer to Figure 32-5 to see when each of the following values is used. The meanings are as follows.

0: An IP address has been deleted.

1: A device has been shut down.

2: A device has been unregistered.

The force parameter is overloaded; it is used to decide two things:

Figure 32-5. fib_netdev_event and fib_inetaddr_event functions

The scope of the routes to delete

When force is 0, the handler deletes all eligible routes except for the ones that lead to locally configured addresses (i.e., scope RT_SCOPE_HOST). When force is 1 or 2, the handler deletes all eligible routes regardless of the scope.

How to handle multipath routes

When force is 2, the handler deletes a multipath route if at least one of its next hops uses the input device dev. When force is 0 or 1, the handler deletes a multipath route only if all the next hops are dead.

fib_sync_down is usually called to handle only one type of event at a time, so either the dev or local argument is set and the other is null.

When local is provided, fib_sync_down removes all the routes that use local as their preferred source address. Remember that routes can be assigned a preferred address, and not necessarily one configured on the route's egress device.

When dev is provided, fib_sync_down removes all the routes whose next hop is reachable via dev.

In both cases, routes are not removed directly; they are just marked dead (not usable) by setting the RTNH_F_DEAD flag. A multipath route is marked dead only when all of its next hops are marked as such. Also, when a next hop of a multipath route is marked dead, the parameters fib_power and nh_power have to be updated as well, to reflect the status of the current next hop (see the section "Next Hop Selection" in Chapter 35).

The return value is the number of fib_info structures marked dead by fib_sync_down. This value is used, for instance, by the caller (such as fib_disable_ip, described later in this section) to decide whether to flush the routing cache.

int fib_sync_up(struct net_device *dev)
int fib_sync_up(struct net_device *dev)

仅当内核支持多路径时才使用此例程。fib_info它的主要工作是当路由的某些下一跳处于活动状态时更新结构中的一些路由参数。fib_info 返回值是其RTNH_F_DEAD标志已被清除的结构体的数量。

This routine is used only when the kernel has support for multipath. Its main job is to update some of the route's parameters in the fib_info structure when some of the route's next hops are alive. The return value is the number of fib_info structures whose RTNH_F_DEAD flag has been cleared.

void fib_flush(void)
void fib_flush(void)

扫描ip_fib_main_tableip_fib_local_table路由表并删除所有 设置了标志fib_info的结构RTNH_F_DEAD。它删除了fib_info结构和关联的fib_alias结构。当实例不再有fib_alias结构时fib_node,后者也会被删除。调用的默认例程参见第34章34-1,上述数据结构之间的关系参见第34图34-1 。fib_flush

当内核支持多路径时,fib_flush扫描所有路由表。

当至少一个fib_info实例被删除时,路由缓存将被刷新rt_cache_flush

返回值是删除的实例数fib_info

Scans the ip_fib_main_table and ip_fib_local_table routing tables and deletes all the fib_info structures that have their RTNH_F_DEAD flags set. It removes both the fib_info structure and the associated fib_alias structure. When there are no more fib_alias structures for a fib_node instance, the latter is also removed. See Table 34-1 in Chapter 34 for the default routine invoked by fib_flush, and Figure 34-1 in Chapter 34 for the relationships between the aforementioned data structures.

When there is support for multipath in the kernel, fib_flush scans all the routing tables.

When at least one fib_info instance is removed, the routing cache is then flushed with rt_cache_flush.

The return value is the number of fib_info instances removed.

static void fib_disable_ip(struct net_device *dev, int force)
static void fib_disable_ip(struct net_device *dev, int force)

在通过调用接收输入的设备上禁用 IP 协议fib_sync_down。当删除的路由数量为正时(根据 的返回值确定fib_sync_down),该函数还会立即刷新路由表:fib_sync_down将路由标记为死亡,并fib_flush实际删除这些路由。fib_disable_ip还立即刷新路由缓存,并要求 ARP 从缓存中清除所有涉及 IP 协议被关闭的设备的条目。

请注意,输入参数force按原样传递给fib_sync_down图 32-6显示了 的内部结构 fib_disable_ip

Disables the IP protocol on the device received in input by calling fib_sync_down. When the number of deleted routes is positive (as determined from the return value of fib_sync_down), the function also flushes the routing table immediately: fib_sync_down marks routes as dead, and fib_flush actually removes these routes. fib_disable_ip also flushes the routing cache immediately and asks ARP to clear from its cache all the entries that refer to the device where the IP protocol is being shut down.

Note that the input parameter force is passed as it is to fib_sync_down. Figure 32-6 shows the internals of fib_disable_ip.

fib_disable_ip函数

图 32-6。fib_disable_ip函数

Figure 32-6. fib_disable_ip function

IP 配置的更改

Changes in IP Configuration

每当设备的 IP 配置发生更改时,路由子系统都会收到通知并通过运行fib_inetaddr_event. 图 32-5总结了该通知链可以传达的所有可能事件触发的操作。以下是主要事件的处理方式:

Whenever a device's IP configuration changes, the routing subsystem receives a notification and handles it by running fib_inetaddr_event. Figure 32-5 summarizes the actions triggered by all the possible events that can be conveyed by this notification chain. Here is how the main events are handled:

NETDEV_UP
NETDEV_UP

本地设备已配置新的IP地址。处理程序必须将必要的路由添加到local_table路由表中。负责此操作的例程是fib_add_ifaddr.

NETDEV_DOWN

IP 地址已从本地设备中删除。处理程序必须删除前一个事件添加的这些路由NETDEV_UP。负责此操作的例程是fib_del_ifaddr.

A new IP address has been configured on a local device. The handler must add the necessary routes to the local_table routing table. The routine responsible for this is fib_add_ifaddr.

NETDEV_DOWN

An IP address has been removed from a local device. The handler must remove these routes that were added by the previous NETDEV_UP event. The routine responsible for this is fib_del_ifaddr.

正如第 30 章“特殊路由”一节中提到的,每次在接口上配置 IP 地址时,内核都会将一组特殊路由添加到名为_的单独路由表中。负责添加的例程ip_fiblocal_table这些特殊的路由是 fib_add_ifaddr,它通过调用fib_magic每个新路由来实现。在第 36 章fib_magic中进行了描述。

As mentioned in the section "Special Routes" in Chapter 30, every time an IP address is configured on an interface, the kernel adds a set of special routes to a separate routing table named ip_fib_local_table. The routine that takes care of adding these special routes is fib_add_ifaddr, which does it by calling fib_magic for each new route. fib_magic is described in Chapter 36.

事件处理期间调用的大多数例程已在前面的“帮助例程”部分中进行了描述。以下小节描述fib_add_ifaddrfib_del_ifaddr

Most of the routines invoked during event handling were described earlier in the section "Helper Routines." The following subsections describe fib_add_ifaddr and fib_del_ifaddr.

添加 IP 地址

Adding an IP address

图 32-7fib_add_ifaddr总结了的逻辑。当此例程收到关于设备上的新地址的通知时,该设备可能不一定被启用。是否添加到该设备的路由取决于该设备是否启用。我们首先看看哪些路由是从 IP 地址派生的,然后是在设备启用或禁用时添加哪些路由。作为一个基本示例,以下是与 IP 地址 10.0.1.1/24 相关的可能路由:

The logic of fib_add_ifaddr is summarized in Figure 32-7. When this routine is notified about a new address on a device, the device may not necessarily be enabled. The choice about whether to add routes to that device depends on whether it is enabled. Let's first see what routes are derived from an IP address, and then which ones are added when the device is enabled or disabled. As a basic example, here are the possible routes pertaining to the IP address 10.0.1.1/24:

路由到地址 10.0.1.1/32
Route to the address 10.0.1.1/32

这只是到指定主机地址的路由。

This is simply the route to the specified host address.

路由到网络地址 10.0.1.0/24
Route to the network address 10.0.1.0/24

这是从 IP 地址及其网络掩码得出的。在我们的示例中,它是 10.0.1.1 和 255.255.255.0 的结果。

This is derived from the IP address and its netmask. In our example, it is the result of 10.0.1.1 & 255.255.255.0.

到广播地址 10.0.1.255/32 和 10.0.1.0/32 的路由
Routes to the broadcast addresses 10.0.1.255/32 and 10.0.1.0/32

这代表了规范规定的内容和最实用的内容之间的折衷。

This represents a compromise between what is mandated by the specification and what is most practical.

Linux 在处理不同的要求方面很慷慨:它在路由表中添加了到两个版本的广播地址的路由ip_fib_local_table。请注意,路由可以区分网络地址和有限广播地址,因为它们具有不同的网络掩码(10.0.1.0/24 与 10.0.1.0/32)。另外,用户可以显式配置广播地址。在这种情况下,fib_add_ifaddr除了其他两个地址之外,例程还会添加一条到该地址的路由。

Linux is generous in handling the different requirements: it adds routes to both versions of the broadcast address in the ip_fib_local_table routing table. Note that routing can distinguish between the network address and limited broadcast address because they have different netmasks (10.0.1.0/24 versus 10.0.1.0/32). In addition, a user can configure a broadcast address explicitly. In this case, the fib_add_ifaddr routine adds a route to that address in addition to the other two.

fib_add_ifaddr函数

图 32-7。fib_add_ifaddr函数

Figure 32-7. fib_add_ifaddr function

广播地址的处理反映了理论与实践之间的分歧。理论上,广播地址应该派生自地址类。对于我们的地址 10.0.0.1/24,这将导致地址 10.255.255.255,如果您使用 ifconfig 配置地址,则这是分配的默认广播地址

The handling of broadcast addresses reflects a split between theory and practice. In theory, the broadcast address is supposed to be derived from the address class. In the case of our address 10.0.0.1/24, this would lead to an address of 10.255.255.255, and that is the default broadcast address assigned if you configure an address with ifconfig.

但是,通常您需要从网络掩码派生的广播地址。这是通过将地址的主机组件中的所有位设置为 1 从网络地址得出的,在我们的例子中生成广播地址 10.0.1.255。如果您使用ip addr配置地址,则默认分配这一更有用的广播地址。

Usually, however, you want a broadcast address derived from the netmask. This is derived from the network address by setting to 1 all the bits in the host's component of the address, which in our case produces a broadcast address of 10.0.1.255. This more useful broadcast address is the default one assigned if you configure an address with ip addr.

两种解决方案之间的差异源于以下事实:10.0.0.1 是 A 类网络 10.0.0.0/8 中的地址。A 类网络通常被划分为更小的网络。例如,10.0.0.0/24 是 A 类网络 10.0.0.0/8 的 C 类子网。如果我们配置了地址 192.168.1.1,ifconfigip addr都会派生出相同的广播地址 192.168.1.255,因为 192.168.1.1 是 C 类网络 192.168.1.0/24 中的地址。有关详细信息,请参阅第 30 章中的“路由的基本要素”部分。

This difference between the two solutions derives from the fact that 10.0.0.1 is an address in the class A network 10.0.0.0/8. Class A networks are commonly subnetted into smaller networks. For example, 10.0.0.0/24 is a class C subnet of the class A network 10.0.0.0/8. If we had configured the address 192.168.1.1, both ifconfig and ip addr would have derived the same broadcast address 192.168.1.255, as 192.168.1.1 is an address in the class C network 192.168.1.0/24. See the section "Essential Elements of Routing" in Chapter 30 for more details.

我们在第 28 章的“从多个接口响应” 一节中看到,IP 地址属于系统,而不属于配置它们的接口。因此,无论设备状态如何,到该 IP 地址的路由始终会添加到路由表中。但是,由地址和广播地址标识的到网络的路由则不然:当设备关闭时,网络和广播地址都不可达,因此为它们创建两条路由是不正确的。 使用设备的标志来发现其状态。ip_fib_local_tablefib_add_ifaddrIFF_UP

We saw in the section "Responding from Multiple Interfaces" in Chapter 28 that IP addresses belong to the system, not to the interfaces on which they are configured. Because of that, the route to the IP address is always added to the ip_fib_local_table routing table regardless of the device status. However, the routes to the network identified by the address and the broadcast addresses are not: when the device is down, neither the network nor the broadcast addresses are reachable, so it would not be correct to create two routes for them. fib_add_ifaddr uses the device's IFF_UP flag to discover its status.

    fib_magic(RTM_NEWROUTE, RTN_LOCAL, addr, 32, prim);
    if (!(dev->flags&IFF_UP))
        返回;
    fib_magic(RTM_NEWROUTE, RTN_LOCAL, addr, 32, prim);
    if (!(dev->flags&IFF_UP))
        return;

因此,当您在禁用的设备上配置 IP 地址时,您仅添加到该 IP 地址的路由。当设备稍后启用时,fib_add_ifaddr将再次调用该例程并添加所有路由。此时它会再次将路由添加到该 IP 地址,但这不是问题,因为路由表拒绝重复的路由。

When you configure an IP address on a disabled device, you therefore add only the route to the IP address. When the device is later enabled, the fib_add_ifaddr routine will be called again and will add all the routes. It adds the route to the IP address again at this point, but this is not a problem because the routing table rejects duplicate routes.

请注意,用于在设备上配置 IP 地址的命令有时也会启用该设备。例如,当您使用ifconfig配置 IP 地址时,您还启用了该设备。IPROUTE2 将两个功能分开:使用ip addr add来配置 IP 地址,使用ip link set来启用或禁用设备。当您浏览源代码并尝试弄清楚给定的内核代码片段如何响应用户空间命令的输入时,理解这些区别非常重要。用户空间命令在第 36 章中有更详细的描述。

Note that the command you use to configure an IP address on a device sometimes enables the device as well. For example, when you configure an IP address with ifconfig, you also enable the device. IPROUTE2 separates the two functions: you use ip addr add to configure an IP address and ip link set to enable or disable a device. It is important to understand these distinctions when you browse the source code and try to figure out how a given piece of kernel code behaves in response to input from a user-space command. The user-space commands are described in more detail in Chapter 36.

表 32-1显示了一些示例命令和创建的路由。

Table 32-1 shows some sample commands and the routes created.

表 32-1。IP 配置和关联派生路由的示例

Table 32-1. Examples of IP configurations and the associated derived routes

命令

Command

主要的

Main

当地的

Local

ip addr add 10.0.1.1/24 dev eth0

ip addr add 10.0.1.1/24 dev eth0

10.0.1.0/24

10.0.1.0/24

10.0.1.1/32(地址)

10.0.1.1/32 (address)

10.0.1.0/32(广播)

10.0.1.0/32 (broadcast)

10.0.1.255/32(广播)

10.0.1.255/32 (broadcast)

ip addr add 10.0.1.1/24 broadcast 10.0.1.100 dev eth0

ip addr add 10.0.1.1/24 broadcast 10.0.1.100 dev eth0

10.0.1.0/24

10.0.1.0/24

10.0.1.1/32(地址)

10.0.1.1/32 (address)

10.0.1.100/32(广播)

10.0.1.100/32 (broadcast)

10.0.1.0/32(广播)

10.0.1.0/32 (broadcast)

10.0.1.255/32(广播)

10.0.1.255/32 (broadcast)

ip addr add 10.0.1.1/32 dev eth0

ip addr add 10.0.1.1/32 dev eth0

 

10.0.1.1/32(地址)

10.0.1.1/32 (address)

ip addr add 10.0.1.1/32 broadcast 10.0.1.255 dev eth0

ip addr add 10.0.1.1/32 broadcast 10.0.1.255 dev eth0

 

10.0.1.1/32(地址)

10.0.1.1/32 (address)

10.0.1.255/32(广播)

10.0.1.255/32 (broadcast)

当显式广播恰好与有限广播地址 255.255.255.255 匹配时,不会向显式广播地址添加任何路由,因为后者由查找例程检查,正如我们将在“输入路由”和“输出路由”部分中看到的那样第35 章)。

When the explicit broadcast happens to match the limited broadcast address 255.255.255.255, no route is added toward the explicit broadcast address because the latter is checked by the lookup routines, as we will see in the sections "Input Routing" and "Output Routing" in Chapter 35).

    if (ifa->ifa_broadcast && ifa->ifa_broadcast != 0xFFFFFFFF)
        fib_magic(RTM_NEWROUTE, RTN_BROADCAST, ifa->ifa_broadcast, 32, prim);
    if (ifa->ifa_broadcast && ifa->ifa_broadcast != 0xFFFFFFFF)
        fib_magic(RTM_NEWROUTE, RTN_BROADCAST, ifa->ifa_broadcast, 32, prim);

广播地址的显式配置允许您甚至在理论上只有一个 IP 地址的 /32 子网上定义广播地址,如表 32-1中的第四个示例(请注意,广播地址 10.0.1.255不属于子网 10.0.1.1/32)。

The explicit configuration of a broadcast address allows you to define a broadcast address even on a /32 subnet where you theoretically have only one IP address, as in the fourth example in Table 32-1 (note that the broadcast address 10.0.1.255 does not fall within the subnet 10.0.1.1/32).

在某些情况下,该功能可能不需要将路由添加到广播地址。这些取决于存储在局部变量中的网络掩码的长度prefixlen

Under some conditions, the function may not need to add the routes to the broadcast addresses. These depend on the length of the netmask, which is stored in the local variable prefixlen:

  • prefixlen为 32 时,子网中只有一个有效地址,因此不需要派生广播或网络路由。

  • When prefixlen is 32, there is only one valid address in the subnet, so there is no need for either the derived broadcast or the network routes.

  • prefixlen为 31 时,只有一位可以使用,因此子网内只有两个地址。具有清除位的地址标识网络,具有设置位的地址是主机地址(该函数正在配置的地址)。在这种情况下,这两个地址需要路由,但任何派生的广播地址不需要路由。

  • When prefixlen is 31, there is only one bit to play with, so there are just two addresses within the subnet. The one with the clear bit identifies the network, and the one with the set bit is the host address (the one the function is configuring). In this case, routes are needed for these two addresses, but not for any derived broadcast addresses.

  • prefixlen小于31时,就有其他地址的空间,因为本地地址与网络和广播地址一起仅使用四个或更多地址中的三个。因此,内核添加了一条到派生的广播地址和网络的路由。

  • When prefixlen is smaller than 31, there is room for other addresses, because the local address together with the network and broadcast addresses use only three out of four or more addresses. Thus, the kernel adds a route to both the derived broadcast addresses and the network.

以下代码显示了如何处理主地址的这些情况 :

The following code shows how these cases are handled for primary addresses :

    if (!ZERONET(前缀) && !(ifa->ifa_flags&IFA_F_SECONDARY) &&
        (前缀!= addr || ifa->ifa_prefixlen < 32)) {
        fib_magic(RTM_NEWROUTE, dev->flags&IFF_LOOPBACK ? RTN_LOCAL :
              RTN_UNICAST, 前缀, ifa->ifa_prefixlen, prim);
        if (ifa->ifa_prefixlen < 31) {
            fib_magic(RTM_NEWROUTE, RTN_BROADCAST, 前缀, 32, prim);
            fib_magic(RTM_NEWROUTE, RTN_BROADCAST, 前缀|~掩码, 32, prim);
        }
    }
}
    if (!ZERONET(prefix) && !(ifa->ifa_flags&IFA_F_SECONDARY) &&
        (prefix != addr || ifa->ifa_prefixlen < 32)) {
        fib_magic(RTM_NEWROUTE, dev->flags&IFF_LOOPBACK ? RTN_LOCAL :
              RTN_UNICAST, prefix, ifa->ifa_prefixlen, prim);
        if (ifa->ifa_prefixlen < 31) {
            fib_magic(RTM_NEWROUTE, RTN_BROADCAST, prefix, 32, prim);
            fib_magic(RTM_NEWROUTE, RTN_BROADCAST, prefix|~mask, 32, prim);
        }
    }
}

辅助地址不存在这些问题。添加辅助地址时,同一设备上配置的同一子网(前缀)上必须已有主地址。如果没有这样的主地址,则说明您犯了错误并且无法接受配置。因此,辅助地址不需要到网络和派生广播的路由:这些路由在配置关联的主地址时已添加。

Secondary addresses have none of these issues. When you add a secondary address there must already be a primary address on the same subnet (prefix) configured on the same device. If there is no such primary address, you have made an error and the configuration cannot be accepted. Thus, routes to the network and to the derived broadcasts are not needed for secondary addresses : these routes were already added when the associated primary address was configured.

删除 IP 地址

Removing an IP address

当您从接口中删除 IP 地址时,路由子系统会收到通知,以便它可以清理其路由表和缓存。处理这个问题的例程是,其逻辑如图 32-8fib_del_ifaddr所示。

When you remove an IP address from an interface, the routing subsystem is notified so that it can clean up its routing tables and cache. The routine that takes care of this is fib_del_ifaddr, whose logic is described in Figure 32-8.

该例程从健全性检查开始。如果您尝试删除辅助地址,同一子网上必须有一个主地址。如果没有,则一定是早些时候某个地​​方出现了问题,并且例程返回了错误。

The routine starts with a sanity check. If you try to remove a secondary address, there must be a primary address on the same subnet. If there isn't, something must have gone wrong earlier somewhere and the routine returns an error.

调用时fib_del_ifaddr,要删除关联路由的 IP 地址已从受影响设备上的配置地址列表中删除(例如,请参阅何时inet_del_ifa触发NETDEV_DOWN通知)。

When fib_del_ifaddr is invoked, the IP address whose associated route is being removed has already been removed from the list of configured addresses on the affected device (see, for example, when inet_del_ifa triggers the NETDEV_DOWN notification).

因为到广播地址和网络地址的路由可能并不总是与主地址一起添加,正如我们在上一节中看到的,所以fib_del_ifaddr扫描设备上所有配置的地址并检查需要删除的内容。例如,您可以在表 32-1中看到配置本地 IP 地址时添加了哪些路由。

Because routes to broadcast addresses and the network address may not always have been added along with the primary address, as we saw in the previous section, fib_del_ifaddr scans all the configured addresses on the device and checks what needs to be removed. You can see, for example, in Table 32-1 what routes are added when configuring a local IP address.

在大多数情况下,当删除辅助地址时,路由子系统只需删除到该 IP 地址的路由。到网络和广播地址的路由不会被删除,因为主地址(以及其他辅助地址,如果有)仍然需要它们。然而,当删除时,有可能辅助 IP 地址时,甚至不需要删除到其 IP 地址的路由:例如,当管理员使用两个不同的网络掩码配置同一地址时,就是这种情况,如下例所示:

In most cases, when a secondary address is removed, the routing subsystem needs to remove only the route to the IP address. The routes to the network and broadcast addresses are not removed because they are still needed by the primary address (and other secondary addresses, if any). However, it is possible that when removing a secondary IP address, it is not even necessary to remove the route to its IP address: this is the case, for example, when an administrator configures the same address with two different netmasks, as in the following example:

## ip addr add dev eth0 192.168.0.1/24
_ip addr add dev eth0 192.168.0.1/16
# ip addr add dev eth0 192.168.0.1/24
# ip addr add dev eth0 192.168.0.1/16
fib_del_ifaddr 函数

图 32-8。fib_del_ifaddr 函数

Figure 32-8. fib_del_ifaddr function

该示例并不代表常见场景,但代码必须能够处理它。

The example does not represent a common scenario, but the code must be able to handle it.

这两个命令派生的到网络和广播地址的路由(如“添加 IP 地址”一节中所述)不同,但它们共享到 IP 地址的路由。

The routes to the network and broadcast addresses derived by the two commands (as described in the section "Adding an IP address") are different, but they share a route to the IP address.

删除需要删除的路由后,使用和fib_del_ifaddr清理路由表。fib_sync_downfib_flush

After removing the routes that need to be removed, fib_del_ifaddr cleans up the routing table with fib_sync_down and fib_flush.

fib_del_ifaddr从设备中删除最后一个 IP 地址时,fib_inetaddr_event禁用该设备上的 IP 协议fib_disable_ip(​​参见 图 32-5)。

When fib_del_ifaddr removes the last IP address from a device, fib_inetaddr_event disables the IP protocol on that device with fib_disable_ip (see Figure 32-5).

设备状态变化

Changes in Device Status

路由子系统向netdev_chain通知链注册三个不同的处理程序来处理设备状态的变化:

The routing subsystem registers three different handlers with the netdev_chain notification chain to handle changes in the status of a device:

  • fib_netdev_event更新路由表。

  • fib_netdev_event updates the routing tables.

  • fib_rules_event当策略路由生效时更新策略数据库。

  • fib_rules_event updates the policy database, when policy routing is in effect.

  • ip_netdev_event更新设备的 IP 配置。[ * ]

  • ip_netdev_event updates the device's IP configuration.[*]

接下来的三节描述这些例程如何处理它们收到的通知。

The next three sections describe how these routines handle the notifications they receive.

对路由表的影响

Impacts on the routing tables

每当设备更改状态或其配置中的其他内容(除了由另一个通知链处理的 IP 配置之外),路由子系统都会收到通知并通过运行fib_netdev_event.

Whenever a device changes state or something else in its configuration (besides the IP configuration, which is taken care of by another notification chain), the routing subsystem receives a notification and handles it by running fib_netdev_event.

图32-5总结了该通知链可以通知的所有可能事件所触发的操作。以下是主要事件的处理方式:

Figure 32-5 summarizes the actions triggered by all the possible events that can be notified by this notification chain. Here is how the main events are handled:

NETDEV_UNREGISTER
NETDEV_UNREGISTER

当设备取消注册时,使用该设备的所有路由都会从路由表(包括缓存)中删除。如果至少一个下一跳使用此设备,多路径路由也会被删除。

When a device is unregistered, all the routes that use this device are removed from the routing tables (cache included). Multipath routes are also removed if at least one of the next hops uses this device.

NETDEV_DOWN
NETDEV_DOWN

当设备出现故障时,使用该设备的所有路由都会从路由表(包括缓存)中删除fib_disable_ip

When a device goes down, all the routes that use this device are removed from the routing tables (cache included) with fib_disable_ip.

NETDEV_UP
NETDEV_UP

当设备启动时,其所有 IP 地址的路由条目必须添加到本地路由表中ip_fib_local_table。这是通过调用fib_add_ifaddr设备上配置的每个 IP 来完成的。 在“添加 IP 地址fib_add_ifaddr”部分中进行了描述。

When a device comes up, routing entries for all its IP addresses must be added to the local routing table ip_fib_local_table. This is accomplished by calling fib_add_ifaddr for each IP configured on the device. fib_add_ifaddr was described in the section "Adding an IP address."

NETDEV_CHANGEMTU
NETDEV_CHANGEMTU

NETDEV_CHANGE
NETDEV_CHANGE

当配置更改应用于设备时,路由表缓存将被刷新。最常见的通知更改包括 MTU 或 PROMISCUITY 状态的修改。

When a configuration change is applied to a device, the routing table cache is flushed. Among the most common notified changes are modifications of the MTU or the PROMISCUITY status.

请注意,路由子系统对该事件不感兴趣NETDEV_REGISTERNETDEV_UP已经足以触发新激活的设备的必要操作。

Note that the routing subsystem is not interested in the NETDEV_REGISTER event. NETDEV_UP is already sufficient to trigger the necessary actions for a newly activated device.

取消注册设备和关闭设备可能会对路由表产生不同的影响。设备被取消注册的一些原因包括用户从内核中删除驱动程序或拔出热插拔设备(例如 PCMCIA 以太网卡)。设备关闭的一些原因包括用户拔下电缆或发出管理命令。在每种情况下,都会从路由表中删除不同的路由。

Unregistering a device and shutting down a device can have different effects on the routing table. Some of the reasons a device can be unregistered include a user removing the driver from the kernel or unplugging a hotplug device such as a PCMCIA Ethernet card. Some of the reasons a device can be shut down include a user unplugging the cable or issuing an administrative command. In each case, different routes are removed from the routing tables.

让我们看一个例子。表 32-2中的第一列显示了使用以下两个命令将添加到路由表中的路由,最后两列分别显示了当设备 eth0 关闭取消注册时将删除哪些路由。

Let's look at an example. The first column in Table 32-2 shows the routes that would be added to the routing table with the following two commands, and the last two columns show what routes would be removed when the device eth0 is shut down or unregistered, respectively.

# ip地址添加dev eth0 192.168.1.100
# ip 路由通过 192.168.1.111 添加 10.0.1.0/24
# ip addr add dev eth0 192.168.1.100
# ip route add  10.0.1.0/24 via 192.168.1.111

表 32-2。设备关闭或取消注册时丢弃的路由

Table 32-2. Routes dropped when a device is shut down or unregistered

路线

Route

路由表

Routing table

关闭

Shut down

未注册

Unregistered

192.168.1.0/24

192.168.1.0/24

主要的

Main

是的

Yes

是的

Yes

192.168.1.0/32

192.168.1.0/32

当地的

Local

是的

Yes

是的

Yes

192.168.1.255/32

192.168.1.255/32

当地的

Local

是的

Yes

是的

Yes

192.168.1.100/32

192.168.1.100/32

当地的

Local

No

是的

Yes

10.0.1.0/24

10.0.1.0/24

主要的

Main

是的

Yes

是的

Yes

当设备关闭时,到该IP地址的路由不会被删除,因为它的IP地址属于主机,而不属于接口。只要其关联设备存在,该地址就存在。请参阅第 28 章中的“从多个接口响应” 部分。

The route to the IP address is not removed when the device is shut down because its IP address belongs to the host, not to the interface. This address exists as long as its associated device exists. See the section "Responding from Multiple Interfaces" in Chapter 28.

对政策数据库的影响

Impacts on the policy database

策略(即,规则)可以与设备相关联。例如,您可以指定应为eth0上接收并寻址到子网 10.0.1.0/24 的流量分配特定优先级。因此,当设备未注册时,通过将其设备 ID(数据结构的字段)设置为无效值 -1,将所有关联的策略(即fib_rule数据结构)标记为不可用。r_ifindexfib_rules_detach

A policy (i.e., a rule) can be associated with a device. You can specify, for instance, that traffic received on eth0 and addressed to the subnet 10.0.1.0/24 should be assigned a specific priority. Therefore, when a device is unregistered, all the associated policies (i.e., fib_rule data structures) are marked as unusable by setting their device ID, the r_ifindex field of the data structure, to the invalid value -1 with fib_rules_detach.

另一方面,当注册设备时,如果存在与该设备关联的任何禁用策略,则会使用 重新启用该设备fib_rules_attachfib_rule由于禁用策略的设备 ID 为 -1,因此内核使用s字段中保存的设备名称 r_ifname来识别与策略关联的设备。

On the other hand, when a device is registered, if there is any disabled policy associated with this device it is re-enabled with fib_rules_attach. Because the device ID of disabled policies is -1, the kernel uses the device's name saved in fib_rule's r_ifname field to recognize the device with which a policy is associated.

对IP配置的影响

Impacts on the IP configuration

以下是处理程序的inetdev_event例程如何处理从链接收的通知netdev_chain

Here is how the handler's inetdev_event routine handles the notifications received from the netdev_chain chain:

NETDEV_UNREGISTER
NETDEV_UNREGISTER

禁用设备上的 IP 协议。

Disables the IP protocol on the device.

NETDEV_UP
NETDEV_UP

使用 启用多播配置(如果存在)ip_mc_up。当上行设备为Loopback设备时,在其上配置127.0.0.1/8地址。

如果上行设备配置的 MTU 小于启用 IP 协议所需的最小值 68,则将忽略此通知。这只是一个健全性检查。

Enables the multicast configuration (if present) with ip_mc_up. When the device going up is the loopback device, configure the 127.0.0.1/8 address on it.

This notification is ignored if the device going up has a configured MTU smaller than the minimum value of 68 that is necessary to enable the IP protocol. This is only a sanity check.

NETDEV_DOWN
NETDEV_DOWN

使用 禁用多播配置(如果存在)ip_mc_down

Disables the multicast configuration (if present) with ip_mc_down.

NETDEV_CHANGEMTU
NETDEV_CHANGEMTU

检查设备的 MTU 设置的值是否小于运行 IP 协议所需的最小值 (68),如果是,则在设备上禁用 IP 协议。

Checks whether the device's MTU has been set to a value smaller than the minimum necessary to run the IP protocol (68), and if so, disables the IP protocol on the device.

NETDEV_CHANGENAME
NETDEV_CHANGENAME

更新目录/proc/sys/net/ipv4/conf/ devname/proc/sys/net/ipv4/neigh/ devname的名称以反映新的设备名称。这些目录分别在第 23 章和第29中描述。

Updates the name of the directories /proc/sys/net/ipv4/conf/ devname and /proc/sys/net/ipv4/neigh/ devname to reflect the new device name. These directories are described in Chapters 23 and 29, respectively.

对于 和NETDEV_UNREGISTERNETDEV_CHANGEMTUIP 协议被禁用 inetdev_destroy。该函数会从设备中删除所有 IP 配置,并相应地清除 ARP 缓存arp_ifdown

For both NETDEV_UNREGISTER and NETDEV_CHANGEMTU, the IP protocol is disabled with inetdev_destroy. That function removes all IP configurations from the device and clears the ARP cache accordingly with arp_ifdown.

与其他子系统的交互

Interactions with Other Subsystems

第31章中的“与其他子系统的交互”部分预见了路由子系统与其他子系统(例如流量控制和防火墙)的主要交互。在下面的小节中,我们将看到更多细节。与基于路由表的分类器的交互被推迟到第 35 章,因为它需要一些关于路由表结构和如何实现查找的背景知识。

The section "Interactions with Other Subsystems" in Chapter 31 anticipated the main interactions that the routing subsystem has with other ones, such as Traffic Control and Firewall. In the following subsections, we will see some more details. The interaction with the routing table based classifier is deferred until Chapter 35 because it requires some background on the routing table structure and on how lookups are implemented.

网络链接通知

Netlink Notifications

添加或删除路由时,会RTMGRP_IPV4_ROUTE使用例程向 Netlink 组发送通知rtmsg_fib。创建和删除的通知分别在fn_hash_insert和中生成fn_hash_delete,这两个重要的例程我们将在 第 34 章中看到。另请参阅第 36 章中的“更改通知”部分。

When a route is added or removed, a notification is sent to the Netlink group RTMGRP_IPV4_ROUTE using the routine rtmsg_fib. Notifications for creation and deletions are respectively generated in fn_hash_insert and fn_hash_delete, two important routines that we will see in Chapter 34. See also the section "Change Notifications" in Chapter 36.

策略路由和基于防火墙的分类器

Policy Routing and Firewall-Based Classifier

正如第 31 章“与其他内核子系统的交互”部分所预期的那样,策略路由可以使用由防火墙代码初始化的标记作为鉴别器来决定将哪个路由表用于入口和出口流量。基于防火墙标记的路由需要特殊的支持才能编译到内核中。如果可用,防火墙标记是缓存和路由表查找键(由结构表示)的一部分。防火墙子系统将标记复制到缓冲区字段中,策略路由可以将其用作鉴别器,以决定使用哪个路由表来路由入口和出口流量。flowiskb->nfmark

As anticipated in the section "Interactions with Other Kernel Subsystems" in Chapter 31, policy routing can use a tag initialized by the firewall code as a discriminator to decide which routing table to use for both ingress and egress traffic. Routing based on firewall tagging requires special support to be compiled into the kernel. When available, the firewall tag is part of the cache and routing table lookup keys (represented by flowi structures). The firewall subsystem copies the tag into the skb->nfmark buffer field, where it can be used as a discriminator by Policy Routing to decide which routing table to use to route ingress and egress traffic.

第33章中,您将看到两个缓存查找例程如何ip_route_input_ _ip_route_output_key检查防火墙标记的值。 第34章展示了两个路由表如何查找例程ip_route_input_slow并 用防火墙标记ip_route_output_slow初始化nfmark查找键字段。0 值表示 不存在标签。flowiskb->nfmarkskb->nfmark

In Chapter 33, you will see how the two cache lookup routines ip_route_input and _ _ip_route_output_key check the value of the firewall tag. Chapter 34 shows how the two routing table lookup routines ip_route_input_slow and ip_route_output_slow initialize the nfmark field of the lookup key flowi with the firewall tag skb->nfmark. A 0 value for skb->nfmark means that no tag exists.

第 31 章中的图 31-4显示了在网络堆栈内部防火墙何时可以根据其配置标记缓冲区,以及策略路由引擎何时将其用于策略规则查找。

Figure 31-4 in Chapter 31 shows when, inside the network stack, the firewall can tag a buffer based on its configuration, and when the policy routing engine uses it for its policy rules lookup.

路由协议守护进程

Routing Protocol Daemons

路由既可以由用户使用iprouteRoute等命令添加,也可以由用户空间中运行的路由协议(例如 BGP、IGRP 和 OSPF)添加。我们在第 31 章的“路由协议守护进程”部分看到了总体情况。在本节中,我们将更详细地介绍用户/内核接口,第 36 章将更详细地介绍实用程序本身。然而,我们不会详细介绍任何路由协议的内部结构,因为它超出了本书的范围。

Routes can be added both by users with commands such as ip route or route and by routing protocols running in user space, such as BGP, IGRP, and OSPF. We saw the big picture in the section "Routing Protocol Daemons" in Chapter 31. In this section, we will go into a little more detail on the user/kernel interface, and Chapter 36 will go into more detail on the utilities themselves. However, we will not cover the internals of any routing protocol in detail because it is outside the scope of the book.

路由协议在用户空间中运行,但它们需要将其知识注入内核,以将其路由合并到内核的路由表中。虽然路由协议代码独立于底层操作系统,但这些协议将路由注入内核的方式必须适应底层操作系统提供的用户/内核接口。

Routing protocols run in user space, but they need to inject their knowledge into the kernel to have their routes be incorporated into the kernel's routing tables. While the routing protocol code is independent from the underlying operating system, the way those protocols inject routes into the kernel has to adapt to the user/kernel interfaces provided by the underlying operating system.

如果您喜欢浏览源代码,我强烈建议您看看独立于平台的代码(基本上是路由协议)如何与依赖于平台的代码交互以将路由注入内核的路由表。例如,您将看到不同的操作系统可能需要不同的接口。即使同一操作系统的不同版本也可能需要或提供不同的接口。

If you like browsing source code, I urge you to look at how the platform-independent code (basically, the routing protocols) interact with the platform-dependent code to inject routes into the kernel's routing table. You will see, for instance, how different operating systems may require different interfaces. Even different versions of the same operating system may require or make available different interfaces.

对于Linux,老一代的ioctl 接口仍然可用,但新的Netlink受到内核的青睐,因为它更强大。虽然ioctlNetlink 对于所有 Unix 风格都很常见,但它是特定于 Linux 的,并且在 Linux 世界中扮演的角色与路由套接字在 BSD 世界中扮演的角色相同。ioctl需要注意的是,当Netlink被编译到内核中时,它由于具有更好的控制和双向能力而受到优先考虑。(例如,对于 Netlink,就像 BSD 上的路由套接字一样,当内核检测到 NIC 上的更改时,它可以通过 Netlink 套接字将其传达给用户空间应用程序,以便应用程序可以采取某些操作。)

With regard to Linux, the old-generation ioctl interface is still available, but the new Netlink is preferred by the kernel because it is more powerful. While ioctl is pretty common to all the Unix flavors, Netlink is Linux-specific and plays the same role in the Linux world that the routing socket plays in the BSD world. It is important to note that when Netlink is compiled into the kernel, it is preferred over ioctl because of its better control and bidirectional capabilities. (For instance, with Netlink—as with the routing socket on BSD—when the kernel detects changes on an NIC, it can communicate it to a user-space application over a Netlink socket so that the application can take some action.)

表 32-3列出了最常见的路由守护程序,并显示了哪些可以处理ioctl接口和 Netlink。第 36 章更详细地介绍了这些接口。

Table 32-3 lists the most common routing daemons and shows which ones can handle the ioctl interface and Netlink. Chapter 36 goes into more detail on these interfaces.

表 32-3。最常见的路由守护程序使用的 Linux 内核接口

Table 32-3. Interfaces to the Linux kernel used by the most common routing daemons

守护进程

Daemon

读写控制

ioctl

网联

Netlink

XORP使用ioctl, 但不插入或删除路由。XORP 的目的和操作在一个有趣的文档中进行了描述:http://www.xorp.org/releases/current/docs/fea/fea.pdf

a XORP uses ioctl, but not to insert or delete routes. The purpose and operation of XORP are described in an interesting document, http://www.xorp.org/releases/current/docs/fea/fea.pdf.

溃败(v 0.17)

ROUTED (v 0.17)

是的

Yes

No

门禁 (v3.6)

GATED (v3.6)

是的

Yes

是的

Yes

鸟 (v1.0.9)

BIRD (v1.0.9)

是的

Yes

是的

Yes

斑马 (v0.94)

ZEBRA (v0.94)

是的

Yes

是的

Yes

斑驴 (v 0.98.0)

QUAGGA (v 0.98.0)

是的

Yes

是的

Yes

捷运(v2.2.0)

MRT (v2.2.0)

是的

Yes

是的

Yes

异或运算 (v1.0)

XORP (v1.0)

没有_

Noa

是的

Yes




[ * ]请参阅文件Documentation/networking/wan-router.txt

[*] See the file documentation/networking/wan-router.txt.

[ * ]您可以下载一个补丁并将其应用到内核中,该补丁允许 Tulip 8390 卡使用“快速切换”“2.4 内核上的功能。启用此功能(例如,使用make xconfig)时可以打开的帮助窗口提供了此补丁的链接。

[*] There is a patch you can download and apply to the kernel that allows the Tulip 8390 card to use the "fast switching " feature on 2.4 kernels. A link to this patch is provided by the help window you can open when you enable this feature (e.g., with make xconfig).

[ * ]不要因为数据结构和数组具有相同的名称而感到困惑。

[*] Do not get confused by the fact that the data structure and the array have the same name.

[ * ] IPv6 使用struct rt6_info

[*] IPv6 uses struct rt6_info.

[ * ]这并不一定意味着物理上更接近。有时,复杂的路由场景可能需要强制数据包通过专用设备,这可能需要次优路由。然而,考虑到数据包从源到目的地应遵循的路径,转发数据包的每个系统都必须使其再经过一跳才能到达最终目的地。

[*] This does not necessarily mean physically closer. Sometimes, complex routing scenarios may need to force packets to go through specialized devices, which may require suboptimal routes. However, given the path that a packet is supposed to follow to go from source to destination, every system that forwards it must make it go one more hop toward its final destination.

[ * ]这包括第 33 章中描述的 onlink 路由的情况。

[*] This includes the case of onlink routes, described in Chapter 33.

[ * ]我们在第19章中看到,这 in_device是用于存储网络设备的IP配置的数据结构。

[*] We saw in Chapter 19 that in_device is the data structure used to store the IP configuration of a network device.

[ * ] IPv6通过调用在inet6_init中执行类似的操作ip6_route_init

[*] IPv6 does something similar in inet6_init by calling ip6_route_init.

[ * ]通知链在第 4 章中描述。

[*] Notification chains are described in Chapter 4.

[ * ]该处理程序由 注册ip_rt_init,但它实际上属于 IP 子系统。

[*] This handler is registered by ip_rt_init, but it actually belongs to the IP subsystem.

第 33 章路由:路由缓存

Chapter 33. Routing: The Routing Cache

路由缓存用于减少路由表的查找时间。路由缓存的中心是Protocol Independent Destination Cache,简称DST。即使策略路由生效(创建多个路由表),单个路由缓存也始终由所有路由表共享。

The routing cache is used to reduce the lookup time on the routing tables. The center of the routing cache is the Protocol Independent Destination Cache, which is simply called DST . Even if policy routing is in effect—creating multiple routing tables—a single routing cache is always shared by all the routing tables.

缓存的主要工作是存储信息,使路由子系统能够找到数据包的目的地,并通过一组功能向更高层提供此信息。缓存还提供了一些管理清理的功能。缓存存储有关适用于所有 L3 协议的路由表缓存条目的信息,因此可以包含在用于表示路由表缓存条目的任何数据结构中。

The main job of the cache is to store information that allows the routing subsystem to find destinations for packets, and to offer this information through a set of functions to higher layers. The cache also offers some functions to manage cleanup. The cache stores the information about the routing table cache entries that applies to all L3 protocols and can therefore be included in any data structure used to represent a routing table cache entry.

在本章中,我们将看到:

In this chapter, we will see:

  • 缓存是如何实现的

  • How the cache is implemented

  • 如何插入新元素和删除现有元素

  • How new elements are inserted and existing ones are deleted

  • 如何实现入口和出口查找以及它们的不同之处

  • How ingress and egress lookups are implemented, and where they differ

  • 外部子系统如何通过 DST 提供的接口与缓存交互

  • How external subsystems can interact with the cache via an interface provided by the DST

  • 不同类型的垃圾收集如何控制缓存的大小

  • How different kinds of garbage collection keep the size of the cache under control

  • DST 如何为出口 ICMP REDIRECT 消息提供速率限制机制

  • How the DST provides a rate-limiting mechanism for egress ICMP REDIRECT messages

路由缓存初始化

Routing Cache Initialization

路由缓存被实现为哈希表。是在 中初始化的ip_rt_init,也就是初始化路由子系统的功能,在第32章“路由子系统初始化” 一节中描述。

The routing cache is implemented as a hash table. It is initialized in ip_rt_init, which is the initialization function of the routing subsystem and is described in the section "Routing Subsystem Initialization" in Chapter 32.

的大小缓存取决于主机中可用的物理内存量。在您的系统上,您可以在启动时打印的消息中或稍后在dmesg命令的输出中找到哈希表的大小。查找字符串“IP:路由缓存哈希表...”,该字符串是自己打印的ip_rt_init。大小保存在 中rt_hash_mask,其以 2 为底的对数保存在rt_hash_log(即 2 rt_hash_log = rt_hash_mask)中。内核指定的默认大小可以由用户引导选项rhash_entries覆盖,该选项存储要在变量中使用的哈希表大小rhash_entries

The size of the cache depends on the amount of physical memory available in the host. On your system, you can find the size of the hash table in the messages printed at boot time, or later in the output of the dmesg command. Look for the string "IP: routing cache hash table of ...", which is printed by ip_rt_init itself. The size is stored in rt_hash_mask, and the base two logarithm of it is saved in rt_hash_log (that is, 2rt_hash_log=rt_hash_mask). The default size assigned by the kernel can be overridden by the user boot option rhash_entries, which stores the hash table size to use in the variable rhash_entries.

特别是,ip_rt_init初始化以下内容:

In particular, ip_rt_init initializes the following:

rt_hash_table
rt_hash_table

路由缓存,定义为哈希表。

The routing cache, defined as a hash table.

rt_hash_mask
rt_hash_mask

rt_hash_log
rt_hash_log

哈希表的大小(桶的数量)rt_hash_table,以及该数字的以 2 为底的对数,当一个值必须移动该位数时,这通常很有用。

The size (number of buckets) of the hash table rt_hash_table, and the base two logarithm of that number, which often is useful when a value has to be shifted by that number of bits.

rt_hash_rnd
rt_hash_rnd

每次刷新路由缓存时都会分配一个新随机值的参数rt_run_flush。此参数用于防止 DoS 攻击,作为在路由缓存中分配元素以降低其分配的确定性的算法的一部分。该变量首先ip_rt_init根据与可用内存和当前相关的参数进行初始化jiffies。随后,在系统运行一段时间并且有机会建立良好的熵后,使用例程重置变量get_random_bytes

A parameter that is assigned a new random value every time the routing cache is flushed with rt_run_flush. This parameter is used to prevent DoS attacks, as part of an algorithm that distributes elements in the routing cache to make their distribution less deterministic. This variable is first initialized by ip_rt_init based on parameters related to available memory and the current jiffies. Later, after the system has been up for a while and there is a chance for it to build up good entropy, the variable is reset using the get_random_bytes routine.

哈希表组织

Hash Table Organization

本节中描述的数据结构在 L3 协议之间略有不同。在 IPv4 中,哈希表桶的类型为rt_hash_bucket,这是一种仅包含指向冲突元素列表的指针和一个锁的结构。“高速缓存锁定”部分描述了锁的使用。

The data structures described in this section vary slightly among L3 protocols. In IPv4, hash table buckets are of type rt_hash_bucket, a structure that includes only a pointer to the list of colliding elements and a lock. The use of the lock is described in the section "Cache Locking."

缓存元素的类型为rtable该结构包括一些与协议相关的字段(在第36章的“ rtable结构”部分中描述),以及与协议无关的类型数据结构,如图33-1所示。该结构包括到相邻层的接口及其缓存、转换器(例如 IPsec)和路由缓存管理。第36章中的“ dst_entry结构”部分详细描述了数据结构,第27章介绍了与相邻层的接口。dst_entrydst_entry

Elements of the cache are of type rtable. This structure includes some protocol-dependent fields, described in the section "rtable Structure" in Chapter 36, and a protocol-independent data structure of type dst_entry, shown in Figure 33-1. The dst_entry structure includes the interface to the neighboring layer and its cache, transformers (such as IPsec), and routing cache management. The section "dst_entry Structure" in Chapter 36 describes the data structure in detail, and Chapter 27 goes over the interface to the neighboring layer.

rtable结构的第一个字段是联合;这使得rtabledst_entry结构可以轻松共享值,例如指向下一个冲突哈希表条目的指针。指针的名称不同(dst_entry使用next,而rtable使用rt_next),但它们引用相同的内存位置。

The first field of the rtable structure is a union; this makes it easy for the rtable and dst_entry structures to share values such as the pointer to the next colliding hash table entry. The names of the pointers differ (dst_entry uses next, whereas rtable uses rt_next), but they refer to the same memory location.

路由缓存结构

图 33-1。路由缓存结构

Figure 33-1. Routing cache structure

结构表
{
        联盟
        {
                结构 dst_entry dst;
                struct rtable *rt_next;
        } 你;
        …………
}
struct rtable
{
        union
        {
                struct dst_entry dst;
                struct rtable *rt_next;
        } u;
        ... ... ...
}

指向结构的指针rtable可以安全地类型转换为指向 a 的指针dst_entry,反之亦然。

A pointer to an rtable structure can be safely typecast to a pointer to a dst_entry, and vice versa.

当访问表进行插入、删除或查找时,路由子系统通过源和目标 IP 地址、TOS 字段以及入口或出口设备的组合来选择表的存储桶。入口设备 ID 在路由入口流量时使用,出口设备 ID 在路由本地生成的出口流量时使用。然而,虽然对于入口流量总是存在已知的入口设备,但是对于出口流量可能尚不知道该出口设备。出口设备只有在路由查找之后才知道,除非路由查找关键字包括出口设备(这对于本地生成的流量来说是可能的,但不是必需的)。

When accessing the table for an insertion, deletion, or lookup, the routing subsystem selects the bucket of the table through a combination of the source and destination IP addresses, the TOS field, and the ingress or egress device. The ingress device ID is used when routing ingress traffic, and the egress device ID is used when routing egress traffic that is locally generated. However, while there is always a known ingress device for ingress traffic, the egress device may not yet be known for egress traffic. The egress device is known only after the routing lookup, unless the routing lookup key includes the egress device (which is possible for locally generated traffic, but not necessary).

主要缓存操作

Major Cache Operations

缓存的协议无关(DST)部分是一组dst_entry数据结构。本章中的大多数活动都是通过dst_entry结构进行的。IPv4和IPv6数据结构rtablert6_info包括dst_entry数据结构。

The protocol-independent (DST) part of the cache is a set of dst_entry data structures. Most of the activities in this chapter happen through a dst_entry structure. The IPv4 and IPv6 data structures rtable and rt6_info both include a dst_entry data structure.

dst_entry结构在名为 的字段中提供了一组虚拟函数dst_ops,它允许高层协议运行特定于协议的函数来操作条目。DST 代码位于net/core/dst.cinclude/net/dst.h中。

The dst_entry structure offers a set of virtual functions in a field named dst_ops, which allows higher-layer protocols to run protocol-specific functions that manipulate the entries. The DST code is located in net/core/dst.c and include/net/dst.h.

所有操作结构的例程都dst_entrydst_前缀开头。请注意,即使它们作用于dst_entry结构,它们实际上也会影响外部rtable结构。

All the routines that manipulate dst_entry structures start with a dst_ prefix. Note that even though they operate on dst_entry structures, they actually affect the outer rtable structures, too.

DST 通过 初始化dst_init,并在启动时调用net_dev_init(参见第 5 章)。

DST is initialized with dst_init, invoked at boot time by net_dev_init (see Chapter 5).

缓存锁定

Cache Locking

只读操作,例如查找,使用与插入、删除等读写操作不同的锁定机制,但它们自然要配合。以下是它们的处理方式:

Read-only operations, such as lookups , use a different locking mechanism from read-write operations such as insertion and deletion, but they naturally have to cooperate. Here is how they are handled:

只读操作
Read-only operations

它们使用“缓存查找”部分中介绍的例程,并受到读取-复制-更新 (RCU) 读锁的保护,如以下快照所示:

rcu_read_lock();
...
perform lookup
...
rcu_read_unlock();

这段代码实际上没有进行锁定,因为读操作可以同时进行而不会互相干扰。

These use the routines presented in the section "Cache Lookup" and are protected by a read-copy-update (RCU) read lock, as in the following snapshot:

rcu_read_lock( );
...
perform lookup
...
rcu_read_unlock( );

This code actually does no locking, because read operations can proceed simultaneously without interfering with each other.

读写操作
Read-write operations

条目的插入(参见“将元素添加到缓存”部分)和条目的删除(参见“删除 DST 条目”部分)使用嵌入在每个存储桶元素中的自旋锁,如图33-1所示。请注意,提供每个存储桶锁可以让不同的处理器同时写入不同的存储桶。

The insertion of an entry (see the section "Adding Elements to the Cache") and the deletion of an entry (see the section "Deleting DST Entries") use the spin lock embedded in each bucket's element and shown in Figure 33-1. Note that the provision of a per-bucket lock lets different processors write simultaneously to different buckets.

第1章解释了用于在路由表缓存中实现锁定的RCU算法,以及读写自旋锁如何与RCU共存。

Chapter 1 explains the RCU algorithm used to implement locking in the routing table cache, and how read-write spin locks coexist with RCU.

缓存条目分配和引用计数

Cache Entry Allocation and Reference Counts

用于分配新缓存条目的内存池是ip_rt_init在引导时创建的。缓存条目是用 分配的 dst_alloc,它返回一个由创建者转换为正确数据类型的空指针。尽管该函数的名称如此,但它并不分配dst_entry结构,而是分配包含这些结构的较大条目: IPv4 的结构(如图 33-1rtable所示)、IPv6 的结构等等。由于可以调用该函数为不同的协议分配不同大小的结构体,因此要分配的结构体的大小通过虚函数来指示,如“rt6_infoentry_sizeDST 和呼叫协议之间的接口。”

A memory pool used to allocate new cache entries is created by ip_rt_init at boot time. Cache entries are allocated with dst_alloc, which returns a void pointer that is cast by the creator to the right data type. Despite the function's name, it does not allocate dst_entry structures, but instead allocates the larger entries that contain those structures: rtable structures for IPv4 (as shown in Figure 33-1), rt6_info for IPv6, and so on. Because the function can be called to allocate structures of different sizes for different protocols, the size of the structure to allocate is indicated through an entry_size virtual function, described in the section "Interface Between the DST and Calling Protocols."

将元素添加到缓存

Adding Elements to the Cache

每次路由入口或出口数据包所需的缓存查找失败时,内核都会查阅路由表并将结果存储到路由缓存中。内核用 分配一个新的缓存条目 dst_alloc,根据路由表的结果初始化它的一些字段,最后调用rt_intern_hash将新条目插入到桶列表头部的缓存中。收到 ICMP REDIRECT 消息后,新路由也会添加到缓存中(请参阅第 25 章)。图33-2(a)33-2(b)显示了逻辑rt_intern_hash当内核编译为支持多路径缓存时,缓存未命中可能会导致将多个路由插入到缓存中,如“多路径缓存”部分中所述。

Every time a cache lookup required to route an ingress or egress packet fails, the kernel consults the routing table and stores the result into the routing cache. The kernel allocates a new cache entry with dst_alloc, initializes some of its fields based on the results from the routing table, and finally calls rt_intern_hash to insert the new entry into the cache at the head of the bucket's list. A new route is also added to the cache upon receipt of an ICMP REDIRECT message (see Chapter 25). Figures 33-2(a) and 33-2(b) shows the logic of rt_intern_hash. When the kernel is compiled with support for multipath caching, a cache miss may lead to the insertion of multiple routes into the cache, as discussed in the section "Multipath Caching."

该函数首先通过发出简单的缓存查找来检查新路由是否已存在。即使由于缓存查找失败而调用该函数,该路由也可能同时由另一个 CPU 添加。如果查找成功,现有的缓存路由将被简单地移动到存储桶列表的头部。(这假设该路由不与多路径路由关联;即,DST_BALANCED未设置其标志。)如果查找失败,则将新路由添加到缓存中。

The function first checks whether the new route already exists by issuing a simple cache lookup. Even though the function was called because a cache lookup failed, the route could have been added in the meantime by another CPU. If the lookup succeeds, the existing cached route is simply moved to the head of the bucket's list. (This assumes the route is not associated with a multipath route; i.e., that its DST_BALANCED flag is not set.) If the lookup fails, the new route is added to the cache.

作为控制缓存大小的一种简单方法,rt_intern_hash每次添加新条目时都会尝试删除条目。因此,在浏览存储桶列表时,rt_intern_hash跟踪最适合删除的路由并测量存储桶列表的长度。仅当符合删除条件的路由(即引用计数为0的路由)且存储桶列表长于可配置参数时,才会删除路由ip_rt_gc_elasticity。如果满足这些条件,rt_intern_hash则调用rt_score 例程来选择要删除的最佳路由。rt_score 根据许多标准,将路线分为三类,从最有价值的路线(最不符合删除条件)到最不有价值的路线(最有资格删除):[ * ]

As a simple way to keep the size of the cache under control, rt_intern_hash tries to remove an entry every time it adds a new one. Thus, while browsing the bucket's list, rt_intern_hash keeps track of the most eligible route for deletion and measures the length of the bucket's list. A route is removed only from those that are eligible for deletion (that is, routes whose reference counts are 0) and when the bucket list is longer than the configurable parameter ip_rt_gc_elasticity. If these conditions are met, rt_intern_hash invokes the rt_score routine to choose the best route to remove. rt_score ranks routes, according to many criteria, into three classes, ranging from most-valuable routes (least eligible to be removed) to least-valuable routes (most eligible to be removed):[*]

rt_intern_hash 函数

图 33-2a。rt_intern_hash 函数

Figure 33-2a. rt_intern_hash function

  • 通过 ICMP 重定向插入的路由,正在由用户空间命令监视,或者计划过期。

  • Routes that were inserted via ICMP redirects, are being monitored by user-space commands, or are scheduled for expiration.

  • 输出路由(用于路由本地生成的数据包)、广播路由、多播路由和到本地地址的路由(用于该主机为其自身生成的数据包)。

  • Output routes (the ones used to route locally generated packets), broadcast routes, multicast routes, and routes to local addresses (for packets generated by this host for itself).

  • 所有其他路由按上次使用的时间戳降序排列:也就是说,首先删除最近最少使用的路由。

  • All other routes in decreasing order of timestamp of last use: that is, least recently used routes are removed first.

rt_score只是将条目尚未使用的时间存储在本地 32 位变量的低 30 位中,然后为第一类路由设置第 31 位,第二类路由设置第 32 位最终值是一个分数,表示该路由被认为有多重要:分数越低,该路由就越有可能被 选为受害者rt_intern_hash

rt_score simply stores the time the entry has not been used in the lower 30 bits of a local 32-bit variable, then sets the 31st bit for the first class of routes and the 32nd bit for the second class of routes. The final value is a score that represents how important that route is considered to be: the lower the score, the more likely the route is to be selected as a victim by rt_intern_hash.

rt_intern_hash 函数

图 33-2b。rt_intern_hash 函数

Figure 33-2b. rt_intern_hash function

将路由缓存与ARP缓存绑定

Binding the Route Cache to the ARP Cache

大多数路由缓存条目都绑定到路由下一跳的 ARP 缓存条目。这意味着路由缓存条目需要现有的 ARP 缓存条目或同一下一跳的成功 ARP 查找。特别是,绑定是针对用于路由本地生成的数据包(由 NULL 入口设备标识符标识)的输出路由和单播转发路由完成的。在这两种情况下,都会要求 ARP 解析下一跳的 L2 地址。转发到广播地址、多播地址和本地主机地址不需要 ARP 解析,因为这些地址是使用其他方式解析的。

Most routing cache entries are bound to the ARP cache entry of the route's next hop. This means that a routing cache entry requires either an existing ARP cache entry or a successful ARP lookup for the same next hop. In particular, the binding is done for output routes used to route locally generated packets (identified by a NULL ingress device identifier) and for unicast forwarding routes. In both cases, ARP is asked to resolve the next hop's L2 address. Forwarding to broadcast addresses, multicast addresses, and local host addresses does not require an ARP resolution because the addresses are resolved using other means.

通向广播和组播地址的出口路由不需要关联的 ARP 条目,因为关联的 L2 地址可以从 L3 地址派生(参见第 26 章中的“特殊情况” 部分)。通向本地地址的路由也不需要ARP,因为匹配该路由的数据包是在本地传递的。

Egress routes that lead to broadcast and multicast addresses do not need associated ARP entries, because the associated L2 addresses can be derived from the L3 addresses (see the section "Special Cases" in Chapter 26). Routes that lead to local addresses do not need ARP either, because packets matching the route are delivered locally.

路由的 ARP 绑定由arp_bind_neighbour. 当该函数由于内存不足而失败时,rt_intern_hash通过调用强制对路由缓存进行积极的垃圾收集操作(请参阅“垃圾收集rt_garbage_collect”部分)。积极的垃圾收集是通过降低阈值然后调用 来完成的。垃圾收集仅尝试一次,并且仅在没有从软件中断上下文中调用时才尝试,否则,CPU 时间成本会太高。垃圾收集完成后,新缓存条目的插入将从缓存查找步骤开始。ip_rt_gc_elasticityip_rt_gc_min_intervalrt_garbage_collectrt_intern_hash

ARP binding for routes is created by arp_bind_neighbour. When that function fails due to lack of memory, rt_intern_hash forces an aggressive garbage collection operation on the routing cache by calling rt_garbage_collect (see the section "Garbage Collection"). The aggressive garbage collection is done by lowering the thresholds ip_rt_gc_elasticity and ip_rt_gc_min_interval and then calling rt_garbage_collect. The garbage collection is tried only once, and only when rt_intern_hash has not been called from software interrupt context, because otherwise, it would be too costly in CPU time. Once garbage collection has completed, the insertion of the new cache entries starts over from the cache lookup step.

缓存查找

Cache Lookup

每当需要查找路由时,内核都会首先查询路由缓存,如果缓存未命中,则回退到路由表。路由表查找过程在第35章中描述;在本节中,我们将研究缓存查找。

Anytime there is a need to find a route, the kernel consults the routing cache first and falls back to the routing table if there is a cache miss. The routing table lookup process is described in Chapter 35; in this section, we will look at the cache lookup.

路由子系统提供了两个不同的函数来进行路由查找,一个用于入口,一个用于出口:

The routing subsystem provides two different functions to do route lookups , one for ingress and one for egress:

ip_route_input
ip_route_input

用于输入流量,可以本地传送或转发。该函数确定如何处理通用数据包(是否在本地传送、转发、丢弃等),但其他子系统也使用它来决定如何处理其入口流量。例如,ARP 使用此函数来查看是否 ARPOP_REQUEST应该应答(参见第 28 章)。

Used for input traffic, which could be either delivered locally or forwarded. The function determines how to handle generic packets (whether to deliver locally, forward, drop, etc.) but is also used by other subsystems to decide how to handle their ingress traffic. For instance, ARP uses this function to see whether an ARPOP_REQUEST should be answered (see Chapter 28).

ip_route_output_key
ip_route_output_key

用于输出流量,该流量是本地生成的,可以在本地传送或传输出去。

Used for output traffic, which is generated locally and could be either delivered locally or transmitted out.

这两个例程可能的返回值包括:

Possible return values from the two routines include:

0
0

路由查找成功。这种情况包括触发成功路由表查找的缓存未命中。

The routing lookup was successful. This case includes a cache miss that triggers a successful routing table lookup.

-ENOBUF
-ENOBUF

由于内存问题,查找失败。

The lookup failed due to a memory problem.

-ENODEV
-ENODEV

查找键包含设备标识符,但它无效。

The lookup key included a device identifier and it was invalid.

-EINVAL
-EINVAL

一般查找失败。

Generic lookup failure.

内核还提供了一组围绕这两个基本函数的包装器,在特定条件下使用。例如,请参阅 TCP 如何使用ip_route_connectip_route_newports

The kernel also provides a set of wrappers around the two basic functions, used under specific conditions. See, for example, how TCP uses ip_route_connect and ip_route_newports.

图 33-3显示了两个主要路由缓存查找例程的内部结构。图中所示的出口函数为 _ _ip_route_output_key,由 间接调用ip_route_output_key

Figure 33-3 shows the internals of two main routing cache lookup routines. The egress function shown in the figure is _ _ip_route_output_key, which is indirectly called by ip_route_output_key.

(a) ip_route_input_key函数; (b)_ _ip_route_output_key函数

图 33-3。(a) ip_route_input_key函数;(b)_ _ip_route_output_key函数

Figure 33-3. (a) ip_route_input_key function; (b) _ _ip_route_output_key function

路由缓存用于存储入口和出口路由,因此在这两种情况下都会尝试缓存查找。如果缓存未命中,函数会调用ip_route_input_slowor ,它通过我们将在第 35 章中介绍的例程ip_route_output_slow查询路由表。函数名称的结尾 强调了从缓存满足的查找与需要查询路由表的查找之间的速度差异。这两条路径也称为快路径和慢路径。fib_lookup_slow

The routing cache is used to store both ingress and egress routes, so a cache lookup is tried in both cases. In case of a cache miss, the functions call ip_route_input_slow or ip_route_output_slow, which consult the routing tables via the fib_lookup routine that we will cover in Chapter 35. The names of the functions end in _slow to underline the difference in speed between a lookup that is satisfied from the cache and one that requires a query of the routing tables. The two paths are also referred to as the fast and slow paths.

一旦通过缓存命中或路由表做出路由决策,并导致成功或失败,查找例程将返回带有初始化的 和 虚拟函数的skb输入skb->dst->input缓冲区skb->dst->outputskb->dst是满足路由请求的缓存条目;如果发生缓存未命中,则会创建一个新的缓存条目并将其链接到skb->dst.

Once the routing decision has been taken, through either a cache hit or a routing table, and resulting either in success or failure, the lookup routines return the input buffer skb with the skb->dst->input and skb->dst->output virtual functions initialized. skb->dst is the cache entry that satisfied the routing request; in case of a cache miss, a new cache entry is created and linked to skb->dst.

然后,将通过调用虚拟函数skb->dst->input(通过名为 的简单包装器调用dst_input)和skb->dst->output(通过名为 的包装器调用dst_output)中的一个或两个来进一步处理数据包。第 18 章中的图 18-1显示了这两个虚拟函数在 IP 堆栈中的调用位置,以及根据流量的方向可以将它们初始化为哪些例程。

The packet will then be further processed by calling either one or both of the virtual functions skb->dst->input (called via a simple wrapper named dst_input) and skb->dst->output (called via a wrapper named dst_output). Figure 18-1 in Chapter 18 shows where those two virtual functions are invoked in the IP stack, and what routines they can be initialized to depending on the direction of the traffic.

第 35 章详细介绍了路由表查找的慢速例程。接下来的两节描述图 33-3中两个高速缓存查找例程的内部结构。他们的代码非常相似;唯一的区别是:

Chapter 35 goes into detail on the slow routines for the routing table lookups. The next two sections describe the internals of the two cache lookup routines in Figure 33-3. Their code is very similar; the only differences are:

  • 在入口处,入口路由的设备需要与入口设备匹配,而出口设备尚不知道,因此只需与空设备 (0) 进行比较。相反的情况适用于出口路由。

  • On ingress, the device of the ingress route needs to match the ingress device, whereas the egress device is not yet known and is therefore simply compared against the null device (0). The opposite applies to egress routes.

  • 如果发生缓存命中,函数将使用宏分别更新in_hit和计数器。与路由缓存和路由表相关的统计信息将在 第 36 章中描述。out_hitRT_CACHE_STAT_INC

  • In case of a cache hit, the functions update the in_hit and out_hit counters, respectively, using the RT_CACHE_STAT_INC macro. Statistics related to both the routing cache and the routing tables are described in Chapter 36.

  • 出口查找需要考虑该标志(请参阅“出口查找RTO_ONLINK” 部分)。

  • Egress lookups need to take the RTO_ONLINK flag into account (see the section "Egress lookup").

  • 出口查找支持多路径缓存,该功能在第 31 章的“多路径缓存支持”部分中介绍。

  • Egress lookups support multipath caching, the feature introduced in the section "Cache Support for Multipath" in Chapter 31.

入口查找

Ingress lookup

ip_route_input用于路由入口数据包。下面是它的原型及其输入参数的含义:

ip_route_input is used to route ingress packets. Here is its prototype and the meaning of its input parameters:

int ip_route_input(struct sk_buff *skb, u32 baddr, u32 Saddr,
           u8 tos,结构 net_device *dev)
int ip_route_input(struct sk_buff *skb, u32 daddr, u32 saddr,
           u8 tos, struct net_device *dev)
skb
skb

触发路由查找的数据包。该数据包不一定必须自行路由。例如,ARP 用于ip_route_input出于其他原因查询本地路由表。在这种情况下,skb将是一个入口 ARP 请求。

Packet that triggered the route lookup. This packet does not necessarily have to be routed itself. For example, ARP uses ip_route_input to consult the local routing table for other reasons. In this case, skb would be an ingress ARP request.

saddr
saddr

daddr
daddr

用于查找的源地址和目标地址。

Source and destination addresses to use for the lookup.

tos
tos

TOS 字段,IP 标头的一个字段。

TOS field, a field of the IP header.

dev
dev

接收数据包的设备。

Device the packet was received from.

ip_route_input根据输入条件选择应包含路由的哈希表存储桶。然后,它会逐一浏览该存储桶中的路由列表,比较所有必要的字段,直到找到匹配项或到达末尾但没有匹配项。

ip_route_input selects the bucket of the hash table that should contain the route, based on the input criteria. It then browses the list of routes in that bucket one by one, comparing all the necessary fields until it either finds a match or gets to the end without a match.

作为输入传递的查找字段将与存储在路由缓存条目的字段[ * ]ip_route_input中的字段进行比较,如以下代码摘录所示。存储桶(变量)是通过输入参数的组合来选择的。路线本身由 变量表示。flrtablehashrth

The lookup fields passed as input to ip_route_input are compared to the fields stored in the fl field[*] of the routing cache entry's rtable, as shown in the following code extract. The bucket (hash variable) is chosen through a combination of input parameters. The route itself is represented by the rth variable.

    hash = rt_hash_code(daddr, Saddr ^ (iif << 5), tos);
    rcu_read_lock();
    for (rth = rcu_dereference(rt_hash_table[hash].chain; rth;
         rth = rcu_dereference(rth->u.rt_next)) {
        if (rth->fl.fl4_dst == baddr &&
            rth->fl.fl4_src == Saddr &&
            rth->fl.iif == iif &&
            rth->fl.oif == 0 &&
#ifdef CONFIG_IP_ROUTE_FWMARK
            rth->fl.fl4_fwmark == skb->nfmark &&
#万一
            rth->fl.fl4_tos == tos) {
            rth->u.dst.lastuse = jiffies;
            dst_hold(&rth->u.dst);
            rth->u.dst.__use++;
            RT_CACHE_STAT_INC(in_hit);
            rcu_read_unlock();
            skb->dst = (struct dst_entry*)rth;
            返回0;
        }
        RT_CACHE_STAT_INC(in_hlist_search);
    }
    rcu_read_unlock();
    hash = rt_hash_code(daddr, saddr ^ (iif << 5), tos);
    rcu_read_lock( );
    for (rth = rcu_dereference(rt_hash_table[hash].chain; rth;
         rth = rcu_dereference(rth->u.rt_next)) {
        if (rth->fl.fl4_dst == daddr &&
            rth->fl.fl4_src == saddr &&
            rth->fl.iif == iif &&
            rth->fl.oif == 0 &&
#ifdef CONFIG_IP_ROUTE_FWMARK
            rth->fl.fl4_fwmark == skb->nfmark &&
#endif
            rth->fl.fl4_tos == tos) {
            rth->u.dst.lastuse = jiffies;
            dst_hold(&rth->u.dst);
            rth->u.dst._ _use++;
            RT_CACHE_STAT_INC(in_hit);
            rcu_read_unlock( );
            skb->dst = (struct dst_entry*)rth;
            return 0;
        }
        RT_CACHE_STAT_INC(in_hlist_search);
    }
    rcu_read_unlock( );

ip_route_input_mc 在多播目标地址缓存未命中的情况下,如果满足以下两个条件之一,则数据包将传递到多播处理程序,否则将被丢弃:

In the case of a cache miss for a destination address that is multicast, the packet is passed to the multicast handler ip_route_input_mc if one of the following two conditions is met, and is dropped otherwise:

  • 目的地址是本地配置的多播地址。这是用 来检查的ip_check_mc

  • The destination address is a locally configured multicast address. This is checked with ip_check_mc.

  • 目标地址不是本地配置的,但内核编译时支持多播路由(CONFIG_IP_MROUTE)。

  • The destination address is not locally configured, but the kernel is compiled with support for multicast routing (CONFIG_IP_MROUTE).

该决定如以下代码所示:

This decision is shown in the following code:

    如果(多播(daddr)){
        结构 in_device *in_dev;
 
        rcu_read_lock();
        if ((in_dev = _ _in_dev_get(dev)) != NULL) {
            int our = ip_check_mc(in_dev,daddr,saddr,
                             skb->nh.iph->协议);
            如果(我们的
#ifdef CONFIG_IP_MROUTE
                || (!LOCAL_MCAST(daddr) && IN_DEV_MFORWARD(in_dev))
#万一
                ){
                rcu_read_unlock();
                返回ip_route_input_mc(skb,daddr,saddr,
                             tos、dev、我们的);
            }
        }
        rcu_read_unlock();
        返回-EINVAL;
    }
    if (MULTICAST(daddr)) {
        struct in_device *in_dev;
 
        rcu_read_lock( );
        if ((in_dev = _ _in_dev_get(dev)) != NULL) {
            int our = ip_check_mc(in_dev, daddr, saddr,
                             skb->nh.iph->protocol);
            if (our
#ifdef CONFIG_IP_MROUTE
                || (!LOCAL_MCAST(daddr) && IN_DEV_MFORWARD(in_dev))
#endif
                ) {
                rcu_read_unlock( );
                return ip_route_input_mc(skb, daddr, saddr,
                             tos, dev, our);
            }
        }
        rcu_read_unlock( );
        return -EINVAL;
    }

最后,如果目标地址不是多播的缓存未命中,则ip_route_input调用ip_route_input_slow,它会查询路由表:

Finally, in the case of a cache miss for a destination address that is not multicast, ip_route_input calls ip_route_input_slow, which consults the routing table:

    返回ip_route_input_slow(skb,daddr,saddr,tos,dev);
}
    return ip_route_input_slow(skb, daddr, saddr, tos, dev);
}

出口查找

Egress lookup

_ _ip_route_output_key用于路由本地生成的数据包,与 非常相似ip_route_input:它首先检查缓存,并ip_route_output_slow在缓存未命中的情况下依赖。当缓存支持多路径时,缓存命中需要进行更多工作:缓存中可能有多个条目符合选择条件,并且必须根据所使用的缓存算法选择正确的条目。选择是通过 完成的multipath_select_route更多详细信息可以在“多路径缓存”部分中找到。

_ _ip_route_output_key is used to route locally generated packets and is very similar to ip_route_input: it checks the cache first and relies on ip_route_output_slow in the case of a cache miss. When the cache supports Multipath, a cache hit requires some more work: more than one entry in the cache may be eligible for selection and the right one has to be selected based on the caching algorithm in use. The selection is done with multipath_select_route. More details can be found in the section "Multipath Caching."

下面是它的原型及其输入参数的含义:

Here is its prototype and the meaning of its input parameters:

int _ _ip_route_output_key(struct rtable **rp, const struct flowi *flp)
int _ _ip_route_output_key(struct rtable **rp, const struct flowi *flp)
rp
rp

当例程返回成功时,*rp被初始化为指向与搜索键匹配的缓存条目flp

When the routine returns success, *rp is initialized to point to the cache entry that matched the search key flp.

flp
flp

搜索键。

Search key.

成功的出口缓存查找需要匹配该RTO_ONLINK标志(如果已设置):

A successful egress cache lookup needs to match the RTO_ONLINK flag, if it is set:

            !((rth->fl.fl4.tos ^ flp->fl4_tos) &
                (IPTOS_RT_MASK | RTO_ONLINK)))
            !((rth->fl.fl4.tos ^ flp->fl4_tos) &
                (IPTOS_RT_MASK | RTO_ONLINK)))

当满足以下两个条件时,上述条件成立:

The preceding condition is true when both of the following conditions are met:

  • 路由缓存条目的 TOS 与搜索关键字中的 TOS 相匹配。注意,TOS字段保存在8位变量的第2、3、4、5位中(如第18章18-3tos所示)。[ * ]

  • The TOS of the routing cache entry matches the one in the search key. Note that the TOS field is saved in the bits 2, 3, 4 and 5 of the 8-bit tos variable (as shown in Figure 18-3 in Chapter 18).[*]

  • RTO_ONLINK标志既可以设置在路由缓存条目上,也可以设置在搜索关键字上,或者两者都不设置。

  • The RTO_ONLINK flag is set on both the routing cache entry and the search key or on neither of them.

您将在第 35 章的“搜索键初始化RTO_ONLINK部分中看到该标志。该标志通过 TOS 变量传递,但与 IP 标头的 TOS 字段无关;它只是使用 TOS 字段中未使用的位(参见第 18 章中的图 18-1))。设置该标志后,意味着目标位于本地子网中,并且不需要进行路由查找(或者,换句话说,路由查找可能会失败,但这不是问题)。这不是管理员在配置路由时设置的标志,而是在进行路由查找时使用它来指定搜索的路由类型必须具有范围RT_SCOPE_LINK,这意味着目的地是直接连接的。然后,在创建关联的路由缓存条目时,该标志将保存在它们中。RTO_ONLINK 例如,通过以下协议进行带有标志集的查找:

You will see the RTO_ONLINK flag in the section "Search Key Initialization" in Chapter 35. The flag is passed via the TOS variable, but it has nothing to do with the IP header's TOS field; it simply uses an unused bit of the TOS field (see Figure 18-1 in Chapter 18). When the flag is set, it means the destination is located in a local subnet and there is no need to do a routing lookup (or, in other words, a routing lookup could fail but that would not be a problem). This is not a flag the administrator sets when configuring routes, but it is used when doing routing lookups to specify that the route type searched must have scope RT_SCOPE_LINK, which means the destination is directly connected. The flag is then saved in the associated routing cache entries when they are created. Lookups with the RTO_ONLINK flag set are made, for example, by the following protocols:

ARP
ARP

当管理员手动配置 ARP 映射时,内核会确保 IP 地址属于本地配置的子网之一。例如,命令arp -s 10.0.0.1 11:22:33:44:55:66 将10.0.0.1 到 11:22:33:44:55:66 的映射添加到 ARP 缓存中。如果根据其路由表,IP 地址 10.0.0.1 不属于本地配置的子网之一(请参阅第 26 章),则该命令将被arp_req_set内核拒绝

When an administrator manually configures an ARP mapping, the kernel makes sure that the IP address belongs to one of the locally configured subnets. For example, the command arp -s 10.0.0.1 11:22:33:44:55:66 adds the mapping of 10.0.0.1 to 11:22:33:44:55:66 to the ARP cache. This command would be rejected by the kernel if, according to its routing table, the IP address 10.0.0.1 did not belong to one of the locally configured subnets (see arp_req_set and Chapter 26).

原始 IP 和 UDP
Raw IP and UDP

当通过套接字发送数据时,用户可以设置该MSG_DONTROUTE标志。当应用程序将数据包从已知接口传输到直接连接的目的地(不需要网关)时,使用此标志,因此内核不必确定出口设备。例如,路由协议和诊断应用程序使用这种传输。

When sending data over a socket, the user can set the MSG_DONTROUTE flag. This flag is used when an application is transmitting a packet out from a known interface to a destination that is directly connected (there is no need for a gateway), so the kernel does not have to determine the egress device. This kind of transmission is used, for instance, by routing protocols and diagnostic applications.

多路径缓存

Multipath Caching

第 31 章的“多路径缓存支持”部分介绍了该功能背后的概念。当内核编译为支持多路径缓存时,查找代码会向缓存添加多个路由,如第 35 章“多路径缓存” 部分所示。在本节中,我们将研究用于实现此功能的关键例程以及缓存算法提供的接口。

The concepts behind this feature are introduced in the section "Cache Support for Multipath" in Chapter 31. When the kernel is compiled with support for multipath caching, the lookup code adds multiple routes to the cache, as shown in the section "Multipath Caching" in Chapter 35. In this section, we will examine the key routines used to implement this feature, and the interface provided by caching algorithms.

注册缓存算法

Registering a Caching Algorithm

缓存算法是使用数据结构的实例来定义的ip_mp_alg_ops,该实例由函数指针组成。根据缓存算法的需要,并非所有函数指针都可以被初始化,但有一个是强制性的:mp_alg_select_route

Caching algorithms are defined with an instance of the ip_mp_alg_ops data structure, which consists of function pointers. Depending on the needs of the caching algorithm, not all function pointers may be initialized, but one is mandatory: mp_alg_select_route.

multipath_alg_register算法分别使用和向内核注册和取消注册multipath_alg_unregister。所有算法都作为net/ipv4/目录中的模块实现。

Algorithms register and unregister with the kernel, respectively, using multipath_alg_register and multipath_alg_unregister. All the algorithms are implemented as modules in the net/ipv4/ directory.

路由缓存和多路径之间的接口

Interface Between the Routing Cache and Multipath

对于数据结构的每个函数指针ip_mp_alg_ops ,内核在include/net/ip_mp_alg.h中定义了一个包装器。下面是每个被调用的时间:

For each function pointer of the ip_mp_alg_ops data structure, the kernel defines a wrapper in include/net/ip_mp_alg.h. Here is when each one is called:

multipath_select_route
multipath_select_route

这是最重要的例行公事。它从缓存中满足给定查找的路由中选择正确的路由(因为它们与相同的多路径路由关联)。该例程由_ _ip_route_output_key我们之前看到的查找函数调用。

This is the most important routine. It selects the right route from the ones in the cache that satisfy a given lookup (because they are associated with the same multipath route). This routine is called by _ _ip_route_output_key, the lookup function we saw earlier.

multipath_flush
multipath_flush

刷新缓存时清除算法保留的任何状态。它被调用(参见“刷新路由缓存rt_cache_flush”部分)。

Clears any state kept by the algorithm when the cache is flushed. It is called by rt_cache_flush (see the section "Flushing the Routing Cache").

multipath_set_nhinfo
multipath_set_nhinfo

当缓存新的多路径路由时,更新算法保留的状态信息。

Updates the state information kept by the algorithm when a new multipath route is cached.

multipath_remove
multipath_remove

当删除多路径路由时(例如,通过rt_free),删除缓存中的正确路由。

Removes the right routes in the cache when a multipath route is removed (for example, by rt_free).

没有一个算法支持multipath_remove,只有加权随机算法使用multipath_flushmultipath_set_nhinfo

None of the algorithms supports multipath_remove, and only the weighted random algorithm uses multipath_flush and multipath_set_nhinfo.

在后面的部分中,我们将看到各种算法需要保留哪些状态信息,以及它们如何实现例程mp_alg_select_route

In later sections, we will see what state information the various algorithms need to keep, and how they implement the mp_alg_select_route routine.

辅助例程

Helper Routines

以下是多路径代码使用的几个例程:

Here are a couple of routines used by the multipath code:

multipath_comparekeys
multipath_comparekeys

比较两个路由选择器。它主要由mp_alg_select_route算法函数用来查找与另一个缓存路由相同的多路径路由关联的缓存路由。

Compares two route selectors. It is used mainly by the mp_alg_select_route algorithm's functions to find cached routes that are associated with the same multipath route as another cached route.

rt_remove_balanced_routes
rt_remove_balanced_routes

给定输入缓存路由,删除该路由以及同一哈希表存储桶上与同一多路径路由关联的所有其他缓存路由。最后一个输入参数rt_remove_balanced_routes返回已删除的缓存路由的数量。rtable 该函数的返回值是哈希桶列表中遵循输入参数路由的下一个实例。调用者使用此返回值从正确的位置恢复对表的扫描。当rt_remove_balanced_routes 删除rtable存储桶列表的最后一个实例时,它返回 NULL。

Given an input cached route, removes it and all the other cached routes on the same hash table's bucket that are associated with the same multipath route. The last input parameter to rt_remove_balanced_routes returns the number of cached routes removed. The function's return value is the next rtable instance in the hash bucket's list that follows the input parameter's route. This return value is used by the caller to resume its scan on the table from the right position. When rt_remove_balanced_routes removes the last rtable instance of the bucket's list, it returns NULL.

算法之间的共同元素

Common Elements Between Algorithms

记住以下三点将帮助您理解处理多路径缓存的代码,特别是mp_alg_select_route缓存算法提供的例程的实现:

Keeping the following three points in mind will help you understand the code that deals with multipath caching, and in particular, the implementation of the mp_alg_select_route routine provided by the caching algorithms:

  • 与多路径路由相关的路由缓存条目可以通过标志来识别DST_BALANCED,该标志是在将它们插入缓存之前设置的(请参阅第 36 章中的“ dst_entry 结构”部分)。我们将在第 35 章中详细了解如何以及何时完成此操作。该标志经常在路由缓存代码中使用,以应用不同的操作,具体取决于缓存的给定条目是否与多路径路由关联。

  • Entries of the routing cache associated with multipath routes can be recognized thanks to the DST_BALANCED flag, which is set prior to their insertion into the cache (see the section "dst_entry Structure" in Chapter 36). We will see exactly how and when this is done in Chapter 35. This flag is often used in the routing cache code to apply different actions, depending on whether a given entry of the cache is associated with a multipath route.

  • 用于定义缓存路由的结构dst_entry包括上次使用的时间戳 ( dst->lastuse)。每次缓存查找返回缓存的路由时,都会更新该路由的时间戳。与多路径路由相关的缓存条目需要特殊处理。当查找返回的缓存条目与多路径路由关联时,与同一多路径路由关联的所有其他缓存条目也必须更新其时间戳。这是避免垃圾收集算法清除路由所必需的。

  • The dst_entry structure used to define cached routes includes a timestamp of last use (dst->lastuse). Each time a cached route is returned by a cache lookup, this timestamp is updated for the route. Cache entries associated with multipath routes need to be handled specially. When the cache entry returned by a lookup is associated with a multipath route, all the other entries of the cache associated with the same multipath route must have their timestamps updated, too. This is necessary to avoid having routes purged by the garbage collection algorithm.

  • 例程的输入mp_alg_select_route是与查找键匹配的第一个缓存条目。考虑到如何将元素添加到路由表缓存,与同一多路径路由关联的缓存的所有其他条目都位于同一存储桶内。因此,mp_alg_select_route将从输入缓存元素开始浏览存储桶列表,并通过标志DST_BALANCEDmultipath_comparekeys例程识别其他路由。

  • The input to the mp_alg_select_route routine is the first cache entry that matches the lookup key. Given how elements are added to the routing table cache, all the other entries of the cache associated with the same multipath route are located within the same bucket. For this reason, mp_alg_select_route will browse the bucket list starting from the input cache element and identify the other routes thanks to the DST_BALANCED flag and the multipath_comparekeys routine.

随机算法

Random Algorithm

该算法不需要保留任何状态信息,因此不需要分配任何内存,也不需要占用大量 CPU 时间来做出决策。该算法所做的只是浏览输入表存储桶的路由,计算符合选择条件的路由数量,使用本地例程生成随机数,并根据random该随机数选择正确的缓存条目。

This algorithm does not need to keep any state information, and therefore it does not need any memory to be allocated, nor does it take up significant CPU time to make its decisions. All the algorithm does is browse the routes of the input table's bucket, count the number of routes eligible for selection, generate a random number with the local routine random, and select the right cache entry based on that random number.

该算法在net/ipv4/multipath_random.c中定义。

The algorithm is defined in net/ipv4/multipath_random.c.

加权随机算法

Weighted Random Algorithm

这是实现最复杂的算法。多路径路由的每个下一跳都可以分配一个权重。该算法随机且与权重成比例地选择正确的下一跳(即缓存中的正确路由)。

This is the algorithm with the most complicated implementation. Each next hop of a multipath route can be assigned a weight. The algorithm selects the right next hop (i.e., the right route in the cache) randomly and proportionally to the weights.

对于每个多路径路由的下一跳,都有一个fib_nh数据结构实例,用于存储权重以及其他参数。我们将在第 34 章中看到这些数据结构位于路由表中的位置。特别是,您可以参考 该章中的图 34-1

For each multipath route's next hop there is an instance of the fib_nh data structure that stores the weight, among other parameters. We will see in Chapter 34 where those data structures are located in the routing table. In particular, you can refer to Figure 34-1 in that chapter.

第 31 章中的“加权随机算法”部分解释了该算法背后的基本概念。为了帮助快速做出决策,该算法构建了一个本地信息数据库,用于访问实例并读取下一跳的权重。图 33-4显示了配置以下两条多路径路由后该数据库的外观:fib_nh

The section "Weighted Random Algorithm" in Chapter 31 explains the basic concepts behind this algorithm. To help make a quick decision, the algorithm builds a local database of information that it uses to access fib_nh instances and to read the weights of the next hops. Figure 33-4 shows what that database would look like after configuration of the following two multipath routes:

## 
_ip route add 10.0.1.0/24 mpath wrandom nexthop via 192.168.1.1 weight 1
                                         nexthop via 192.168.2.1 weight 2ip route add 10.0.2.0/24 mpath wrandom nexthop via 192.168.1.1 weight 5
                                         nexthop via 192.168.2.1 weight 1
#ip route add 10.0.1.0/24 mpath wrandom nexthop via 192.168.1.1 weight 1
                                         nexthop via 192.168.2.1 weight 2
#ip route add 10.0.2.0/24 mpath wrandom nexthop via 192.168.1.1 weight 5
                                         nexthop via 192.168.2.1 weight 1

实际上,当定义多路径路由时,数据库并不是立即构建的:它是在查找时填充的。

The database is actually not built right away when the multipath routes are defined: it is populated at lookup time.

请记住,例程的输入mp_alg_select_routewrandom_select_route在本例中)是路由缓存中与搜索键匹配的第一个缓存路由。所有其他符合条件的缓存路由将位于同一个路由缓存桶中。

Remember that the input to the mp_alg_select_route routine (wrandom_select_route in this case) is the first cached route of the routing cache that matches the search key. All other eligible cached routes will be in the same routing cache bucket.

路线的选择mp_alg_select_route分两步完成:

Selection of the route by mp_alg_select_route is accomplished in two steps:

  1. mp_alg_select_route首先浏览路由缓存的存储桶,然后对于每条路由,检查它是否符合例程的选择条件multipath_comparekeys。与此同时,它创建了一个符合条件的缓存路由的本地列表,其主要目标是定义一条类似于第31 章图 31-4中的线路。图 33-5显示了该章中示例的列表。添加到列表中的每条路由都使用图 33-4中的数据库获取其权重,并相应地初始化该字段。power

    加权随机算法创建的下一跳数据库

    图 33-4。加权随机算法创建的下一跳数据库

    为下一跳选择创建的临时列表示例

    图 33-5b。为下一跳选择创建的临时列表示例

  2. mp_alg_select_route first browses the routing cache's bucket, and for each route, checks whether it is eligible for selection with the multipath_comparekeys routine. In the meantime, it creates a local list of eligible cached routes, with the main goal of defining a line like the one in Figure 31-4 in Chapter 31. Figure 33-5 shows what the list would look like for the example in that chapter. Each route added to the list gets its weight using the database in Figure 33-4 and initializes the power field accordingly.

    Figure 33-4. Next-hop database created by the weighted random algorithm

    Figure 33-5b. Example of temporary list created for the next-hop selection

  3. mp_alg_select_route生成一个随机数,并根据给定的合格路由列表,使用第 31 章“加权随机算法”部分中描述的机制选择一条路由。

  4. mp_alg_select_route generates a random number and, given the list of eligible routes, selects one route using the mechanism described in the section "Weighted Random Algorithm" in Chapter 31.

让我们看看数据库查找是如何state工作的。让我们记住,缓存的路由(即rtable实例)包含下一跳路由器和出口设备。给定缓存的路由,_ _multipath_lookup_weight首先state根据出口设备选择右侧的存储桶:state根据该设备进行索引。一旦选择了一个存储桶state,就会扫描元素列表 multipath_route,寻找与网关和设备字段匹配的元素。一旦multipath_route识别出正确的实例,就会扫描相关结构的列表 multipath_dest,寻找与输入查找键的目标 IP 地址相匹配的结构fl。从匹配multipath_destnh_info例如,该函数可以通过指向正确实例的指针读取下一跳权重fib_nh

Let's see how a lookup on the state database works. Let's keep in mind that cached routes (that is, rtable instances) contain the next hop router and the egress device. Given a cached route, _ _multipath_lookup_weight first selects the right state's bucket based on the egress device: state is indexed based on that device. Once a bucket of state has been selected, the list of multipath_route elements is scanned, looking for one that matches the gateway and device fields. Once the right multipath_route instance has been identified, the list of associated multipath_dest structures is scanned, looking for one that matches the destination IP address of the input lookup key fl. From the matching multipath_dest instance, the function can read the next-hop weight via the pointer nh_info that points to the right fib_nh instance.

该数据库由我们在“路由缓存和多路径之间的接口”部分中看到的例程state填充。multipath_set_nhinfo

The state database is populated by the multipath_set_nhinfo routine we saw in the section "Interface Between the Routing Cache and Multipath."

该算法在net/ipv4/multipath_random.c中定义。

This algorithm is defined in net/ipv4/multipath_random.c.

循环算法

Round-Robin Algorithm

循环算法不需要额外的数据结构来保存它需要的状态信息。所有必需的信息都是从结构dst->_ _use的字段中检索的dst_entry,该字段表示缓存查找返回路由的次数。因此,正确路由的选择只需浏览输入表存储桶的路由,并在符合条件的路由中选择 值最低的路由即可_ _use

The round-robin algorithm does not need additional data structures to keep the state information it needs. All the required information is retrieved from the dst->_ _use field of the dst_entry structure, which represents the number of times a cache lookup returned the route. The selection of the right route therefore consists simply of browsing the routes of the input table's bucket, and selecting, among the eligible routes, the one with the lowest value of _ _use.

该算法在net/ipv4/multipath_rr.c中定义。

The algorithm is defined in net/ipv4/multipath_rr.c.

设备循环算法

Device Round-Robin Algorithm

该算法的目的和效果在第31章“设备循环算法” 部分中进行了解释。该算法选择正确的出口设备,从而为给定的多路径路由选择缓存中的正确条目,例程如下:drr_select_route

The purpose and effect of this algorithm were explained in the section "Device Round-Robin Algorithm" in Chapter 31. This algorithm selects the right egress device, and therefore the right entry in the cache for a given multipath route, with the drr_select_route routine as follows:

  1. 全局向量state为每个设备保留一个计数器,指示已选择的次数。

  2. The global vector state keeps a counter for each device that indicates how many times is has been selected.

  3. 对于每个多路径路由,仅考虑任何给定设备上的第一个下一跳。这加快了决策速度,但意味着共享同一出口设备的下一跳之间没有负载共享:对于每个设备,仅使用任何多路径路由的一个下一跳。

  4. For each multipath route, only the first next hop on any given device is considered. This speeds up the decision but implies that there is no load sharing between next hops that share the same egress device: for each device, only one next hop of any multipath route is used.

  5. 在浏览路由(即下一跳)以计算最低使用计数时,与尚未使用的设备关联的路由被给予更高的优先级。当选择新设备时,一个新条目将添加到 中state

  6. While browsing the routes (i.e., next hops) for the computation of the lowest use count, routes associated with devices that have not been used yet are given higher preference. When a new device is selected, a new entry is added to state.

  7. 选择为使用次数最少的设备分析的第一条路线。

  8. The first route analyzed for the device with the lowest use count is selected.

该算法在net/ipv4/multipath_drr.c中定义。

The algorithm is defined in net/ipv4/multipath_drr.c.

DST 和呼叫协议之间的接口

Interface Between the DST and Calling Protocols

DST缓存是一个独立的子系统;例如,它有自己的垃圾收集机制。作为一个子系统,它提供了一组函数,各种协议可以使用这些函数来更改或调整其行为。当外部子系统需要与路由缓存交互时,例如向其通知事件或读取其中一个参数的值,它们通过文件 net/core/dst.c 中定义的一组 DST 例程完成包括/net/dst.h。这些例程是一组函数的包装器,这些函数由拥有缓存的 L3 协议提供,通过初始化 VFT 实例,如图33-6dst_ops所示。

The DST cache is an independent subsystem; it has, for instance, its own garbage collection mechanism. As a subsystem, it provides a set of functions that various protocols can use to change or tune its behavior. When external subsystems need to interact with the routing cache, such as to notify it of an event or read the value of one of its parameters, they do it via a set of DST routines defined in the files net/core/dst.c and include/net/dst.h. These routines are wrappers around a set of functions made available by the L3 protocol that owns the cache, by initializing an instance of a dst_ops VFT, as shown in Figure 33-6.

dst_ops接口

图 33-6。dst_ops接口

Figure 33-6. dst_ops interface

DST 向高层呈现的关键结构是dst_entry:特定于协议的结构,例如rtable仅仅是该结构的包装器。IP 拥有路由缓存,但其他协议通常保留对路由缓存元素的引用。所有这些引用均指的是dst_entry,而不是其rtable包装器。缓冲区sk_buff还保留对结构的引用dst_entry,而不是对 rtable结构的引用。该引用用于存储路由查找的结果。

The key structure presented by DST to higher layers is dst_entry; protocol-specific structures such as rtable are merely wrappers for this structure. IP owns the routing cache, but other protocols often keep references to routing cache elements. All of those references refer to dst_entry, not to its rtable wrapper. The sk_buff buffers also keep a reference to the dst_entry structure, not to the rtable structure. This reference is used to store the result of the routing lookup.

和结构在第 36 章dst_entrydst_ops相关部分中有详细描述。每个协议都有一个实例;例如,IPv4 使用,在net/ipv4/route.c中初始化:dst_opsipv4_dst_ops

The dst_entry and dst_ops structures are described in detail in the associated sections in Chapter 36. There is an instance of dst_ops for each protocol; for example, IPv4 uses ipv4_dst_ops, initialized in net/ipv4/route.c:

结构 dst_ops ipv4_dst_ops = {
    .family = AF_INET,
    .protocol = _ _constant_htons(ETH_P_IP),
    .gc = rt_garbage_collect,
    .check = ipv4_dst_check,
    .destroy = ipv4_dst_destroy,
    .ifdown = ipv4_dst_ifdown,
    .负向建议 = ipv4_负向建议,
    .link_failure = ipv4_link_failure,
    .update_pmtu = ip_rt_update_pmtu,
    .entry_size = sizeof(struct rtable),
};
struct dst_ops ipv4_dst_ops = {
    .family =          AF_INET,
    .protocol =        _ _constant_htons(ETH_P_IP),
    .gc =              rt_garbage_collect,
    .check =           ipv4_dst_check,
    .destroy =         ipv4_dst_destroy,
    .ifdown =          ipv4_dst_ifdown,
    .negative_advice = ipv4_negative_advice,
    .link_failure =    ipv4_link_failure,
    .update_pmtu =     ip_rt_update_pmtu,
    .entry_size =      sizeof(struct rtable),
};

每当 DST 子系统收到事件通知或通过 DST 接口例程之一发出请求时,通过调用VFT实例dst_entry提供的适当函数来通知与受影响实例关联的协议。例如,如果 ARP 想要通知上层协议有关给定 IPv4 地址不可达的信息,它会调用 关联的结构(请记住,缓存的路由与 IP 地址关联,而不是与网络关联),该结构将调用由IPv4 通过.dst_entrydst_opsdst_link_failuredst_entryipv4_link_failureipv4_dst_ops

Whenever the DST subsystem is notified of an event or a request is made via one of the DST interface routines, the protocol associated with the affected dst_entry instance is notified by an invocation of the proper function among the ones provided by the dst_entry through its instance of the dst_ops VFT. For example, if ARP would like to notify the upper protocol about the unreachability of a given IPv4 address, it calls dst_link_failure for the associated dst_entry structure (remember that cached routes are associated with IP addresses, not with networks), which will invoke the ipv4_link_failure routine registered by IPv4 via ipv4_dst_ops.

调用协议也可以直接干预 DST 的行为。例如,当 IPv4 要求 DST 分配新的缓存条目时,DST 可能会意识到需要启动垃圾收集并调用rt_garbage_collectIPv4 本身提供的例程。

It is also possible for the calling protocol to intervene directly in DST's behavior. For example, when IPv4 asks DST to allocate a new cache entry, DST may then realize there is a need to start garbage collection and invoke rt_garbage_collect, the routine provided by IPv4 itself.

当给定类型的通知需要所有协议通用的某种处理时,通用逻辑可以直接在 DST API 内部实现,而不是在每个协议的处理程序中复制。

When a given type of notification requires some kind of processing common to all the protocols, the common logic may be implemented directly inside the DST APIs instead of being replicated in each protocol's handler.

DSTdst_ops 结构中的一些虚函数是通过高层的包装器来调用的;没有包装器的函数可以直接通过语法调用dst->ops-> function。以下是虚拟函数的含义dst_ops以及将分配给它们的 IPv4 子系统例程(在前面的代码快照中列出)的简要描述:

Some virtual functions in the DST's dst_ops structure are invoked through wrappers in higher layers; functions that do not have a wrapper are invoked directly through the syntax dst->ops-> function. Here is the meaning of the dst_ops virtual functions and a brief description of the IPv4 subsystem's routines (listed in the preceding snapshot of code) that would be assigned to them:

gc
gc

负责垃圾收集。当子系统分配一个新的缓存条目并且dst_alloc该函数意识到内存不足时,它就会运行。IPv4 例程在“同步清理rt_garbage_collect”部分中进行了描述。

Takes care of garbage collection. It is run when the subsystem allocates a new cache entry with dst_alloc and that function realizes there is a shortage of memory. The IPv4 routine rt_garbage_collect is described in the section "Synchronous Cleanup."

check
check

标记为失效的缓存路由dst_entry通常不可用。然而,在一种情况下,当使用 IPsec 时,情况不一定如此。该例程用于检查过时的是否 dst_entry可用。例如,查看 例程,该例程在删除ipv4_dst_check提交的结构之前不对其进行检查,并将其与用于对 IPsec 进行“xfrm”转换的相应例程进行比较。另请参阅例程(如(第 21 章中介绍 ))如何检查缓存路由的状态。此函数没有包装器。dst_entryxfrm_dst_checksk_dst_check

A cached route whose dst_entry is marked as dead is normally not usable. However, there is one case, where IPsec is in use, where that's not necessarily true. This routine is used to check whether an obsolete dst_entry is usable. For instance, look at the ipv4_dst_check routine, which performs no check on the submitted dst_entry structure before removing it, and compare it to the corresponding xfrm_dst_check routine used to do "xfrm" transforms for IPsec. Also see how routines such as sk_dst_check (introduced in Chapter 21) check the status of a cached route. There is no wrapper for this function.

destroy
destroy

由 调用dst_destroy,DST 运行以删除结构的例程dst_entry,并通知调用协议删除操作,以便它有机会首先执行任何必要的清理。例如,IPv4例程ipv4_dst_destroy使用该通知来释放对其他数据结构的引用。在“删除 DST 条目dst_destroy”部分中进行了描述。

Called by dst_destroy, the routine that the DST runs to delete a dst_entry structure, and informs the calling protocol of the deletion to give it a chance to do any necessary cleanup first. For example, the IPv4 routine ipv4_dst_destroy uses the notification to release references to other data structures. dst_destroy is described in the section "Deleting DST Entries."

ifdown
ifdown

调用者dst_ifdown,当设备关闭或取消注册时,由 DST 子系统本身调用。对于每个受影响的缓存路由调用一次(请参阅“外部事件”部分)。IPv4 例程 将指向设备 IP 配置的指针 ipv4_dst_ifdown替换为指向环回设备的指针,因为环回设备始终肯定存在。rtableidev

Called by dst_ifdown, which is invoked by the DST subsystem itself when a device is shut down or unregistered. It is called once for each affected cached route (see the section "External Events"). The IPv4 routine ipv4_dst_ifdown replaces the rtable's pointer to the device's IP configuration idev with a pointer to the loopback device, because that is always sure to exist.

negative_advice
negative_advice

由 DST 函数调用dst_negative_advice,用于通知 DST 有关dst_entry实例的问题。例如,TCP dst_negative_advice在检测到写入超时时使用。

IPv4 的例程ipv4_negative_advice使用此通知来删除缓存的路由。当dst_entry已被标记为死亡时(通过其标志,正如我们将在“删除 DST 条目dst->obsolete”部分中看到的), 只需释放 的 对 的引用。ipv4_negative_advicertabledst_entry

Called by the DST function dst_negative_advice, which is used to notify the DST about a problem with a dst_entry instance. For example, TCP uses dst_negative_advice when it detects a write timeout.

The IPv4's routine ipv4_negative_advice uses this notification to delete the cached route. When the dst_entry is already marked as dead (through its dst->obsolete flag, as we will see in the section "Deleting DST Entries"), ipv4_negative_advice simply releases the rtable's reference to the dst_entry.

link_failure
link_failure

由 DST 函数调用dst_link_failure,当检测到由于无法到达目标而导致传输问题时调用该函数。

作为此函数使用的一个示例,邻居协议 ARP 和邻居发现(分别由 IPv4 和 IPv6 使用)调用它来表明它们从未收到对其生成的用于解析 L3 到 L2 地址关联的请求请求的答复。(他们通常可以因为超时而知道这一点;例如,请参阅net/ipv4/arp.carp_error_report中的ARP 协议的行为。)其他高层协议,例如各种隧道(IP over IP 等) .),当他们在到达隧道的另一端(可能有几跳之外)时遇到问题时,请执行相同的操作;例如,参见net /ipv4/ipip.cipip_tunnel_xmit 用于 IP-over IP 隧道协议。

Called by the DST function dst_link_failure, which is invoked when a transmission problem is detected due to an unreachable destination.

As an example of this function's use, the neighbor protocols ARP and Neighbor Discovery—used by IPv4 and IPv6, respectively—invoke it to indicate that they never received a reply to solicitation requests they generated to resolve an L3-to-L2 address association. (They can usually tell this because of a timeout; see, for example, arp_error_report in net/ipv4/arp.c for the behavior of the ARP protocol.) Other higher-layer protocols, such as the various tunnels (IP over IP, etc.), do the same when they have problems reaching the other end of a tunnel, which could be several hops away; see, for example, ipip_tunnel_xmit in net/ipv4/ipip.c for the IP-over-IP tunneling protocol.

update_pmtu
update_pmtu

更新缓存路由的 PMTU。通常调用它来处理 ICMP Fragmentation Needed 消息的接收。请参阅第 31 章中的“处理入口 ICMP_REDIRECT 消息”部分。此函数没有包装器。

Updates the PMTU of a cached route. It is usually invoked to handle the reception of an ICMP Fragmentation Needed message. See the section "Processing Ingress ICMP_REDIRECT Messages" in Chapter 31. There is no wrapper for this function.

get_mss
get_mss

返回可在此路由上使用的 TCP 最大分段大小。IPv4 不会初始化该例程,并且没有该函数的包装器。请参阅“ IPsec 转换和 dst_entry 的使用”部分。

Returns the TCP maximum segment size that can be used on this route. IPv4 does not initialize this routine, and there is no wrapper for this function. See the section "IPsec Transformations and the Use of dst_entry."

除了刚刚显示的函数的包装器之外,DST 还 dst_entry通过不需要与其他子系统交互的函数来操作实例。例如,“异步清理”部分展示了dst_set_expires第 26 章展示了如何dst_confirm使用 来确认邻居的可达性。 有关更多详细信息,请参阅文件net/core/dst.cinclude/net/dst.h 。

Besides the wrappers around the functions just shown, the DST also manipulates dst_entry instances through functions that do not need to interact with other subsystems. For example, the section "Asynchronous Cleanup" shows dst_set_expires, and Chapter 26 shows how dst_confirm is used to confirm the reachability of a neighbor. See the files net/core/dst.c and include/net/dst.h for more details.

IPsec 转换和 dst_entry 的使用

IPsec Transformations and the Use of dst_entry

在前面的部分中,我们看到了结构的最常见用途dst_entry:存储有关缓存路由的协议独立信息,包括处理路由查找后接收或传输的数据包的方法input和方法。output

In the previous sections, we saw the most common use for dst_entry structures: to store the protocol-independent information regarding a cached route, including the input and output methods that process the packets to be received or transmitted after a routing lookup.

结构的另一个用途dst_entry是 IPsec,这是一套协议,用于在 IP 之上提供身份验证和机密性等安全服务。IPsec 使用dst_entry 结构来构建所谓的转换束 。转换 是应用于数据包的操作,例如加密捆绑只是定义为操作序列的一组转换一旦 IPsec 协议决定应用于与给定路由匹配的流量的所有转换,该信息就会作为结构列表存储在路由缓存中dst_entry

Another use for dst_entry structures is made by IPsec, a suite of protocols used to provide secure services such as authentication and confidentiality on top of IP. IPsec uses dst_entry structures to build what it calls transformation bundles . A transformation is an operation to apply to a packet, such as encryption. A bundle is just a set of transformations defined as a sequence of operations. Once the IPsec protocols decide on all the transformations to apply to the traffic that matches a given route, that information is stored in the routing cache as a list of dst_entry structures.

通常,路由与单个dst_entry结构相关联,其input和 字段描述如何处理匹配的数据包(转发、本地传递等,如第 18 章中的图 18-1output所示)。但是 IPsec 创建了一个实例列表 ,其中只有最后一个实例使用并实际应用路由决策;前面的实例使用 和应用所需的转换,如图33-7 所示(图中的模型是简化的模型)。dst_entryinputoutputinputoutput

Normally, a route is associated with a single dst_entry structure whose input and output fields describe how to process the matching packets (forward, deliver locally, etc., as shown in Figure 18-1 in Chapter 18). But IPsec creates a list of dst_entry instances where only the last instance uses input and output to actually apply the routing decisions; the previous instances use input and output to apply the required transformations, as shown in Figure 33-7 (the model in the figure is a simplified one).

在没有 IPsec 的情况下使用 dst_entry (a); (b) 使用 IPsec

图 33-7。在没有 IPsec 的情况下使用 dst_entry (a);(b) 使用 IPsec

Figure 33-7. Use of dst_entry (a) without IPsec; (b) with IPsec

dst_entry列表是使用child结构中的指针创建的。另一个名为 的指针 path(也由 IPsec 使用)指向列表的最后一个元素(即使不使用 IPsec 也会创建该元素)。

dst_entry lists are created using the child pointer in the structure. Another pointer named path, also used by IPsec, points to the last element of the list (the one that would be created even when IPsec is not in use).

列表中的每个其他dst_entry元素(即除最后一个元素之外的每个元素)都用于实现 IPsec 转换。每个都将其path字段设置为指向最后一个元素。此外,每个子系统都设置其DST_NOHASH标志,以便 DST 子系统知道它不是路由缓存哈希表的一部分,并且另一个子系统正在处理它。

Each of the other dst_entry elements in the list—that is, each element except the last—is there to implement an IPsec transformation. Each sets its path field to point to the last element. In addition, each sets its DST_NOHASH flag so that the DST subsystem knows it is not part of the routing cache hash table and that another subsystem is taking care of it.

IPsec 对路由查找的影响如下:输入和输出路由查找都受到图 33-7(b)中 IPsec 配置所示的数据结构布局的影响。dst_entry查找返回的结果是指向第一个实现转换的指针,而不是代表真实路由信息的最后一个。这是因为第一个dst_entry实例代表要应用的第一个转换,并且必须按顺序应用转换。

The implications of IPsec on routing lookups are as follows: both input and output routing lookups are affected by the data structure layout shown for IPsec configuration in Figure 33-7(b). The result returned by a lookup is a pointer to the first dst_entry that implements a transformation, not the last one representing the real routing information. This is because the first dst_entry instance represents the first transformation to be applied, and the transformations must be applied in order.

您可以在其他几个地方找到 IP 或路由层与 IPsec 之间的交互:

You can find interactions between the IP or routing layer and IPsec in several other places:

  • 对于出口流量ip_route_output_flow (由 调用,在“缓存查找ip_route_output_key”部分中介绍)包括与 IPsec 交互的额外代码(即对 的调用)。xfrm_lookup

  • For egress traffic, ip_route_output_flow (which is called by ip_route_output_key, introduced in the section "Cache Lookup") includes extra code (i.e., a call to xfrm_lookup) to interact with IPsec.

  • 对于要本地传送的入口流量,ip_local_deliver_finishxfrm4_policy_check调用查询 IPsec 策略数据库。

  • For ingress traffic that is to be delivered locally, ip_local_deliver_finish calls xfrm4_policy_check to consult the IPsec policy database.

  • ip_forward对需要转发的入口流量进行相同的检查。

  • ip_forward makes the same check for ingress traffic that needs to be forwarded.

有时,IP 代码直接调用通用xfrm_ xxxIPsec 例程,有时它使用名称为 的 IPv4 包装器xfrm4_ xxx

Sometimes the IP code makes a direct call to the generic xfrm_ xxx IPsec routines, and sometimes it uses IPv4 wrappers with the names xfrm4_ xxx.

外部事件

External Events

dst_init 初始化DST子系统时,它会向设备事件通知链进行注册 netdev_chain,这将在第4章中介绍。DST 唯一感兴趣的两个事件是网络设备宕机 ( NETDEV_DOWN) 和设备取消注册 ( NETDEV_UNREGISTER) 时生成的事件。您可以在include/linux/notifier.hNETDEV_ XXX中找到完整的事件列表 。

When dst_init initializes the DST subsystem, it registers with the device event notification chain netdev_chain, introduced in Chapter 4. The only two events the DST is interested in are the ones generated when a network device goes down (NETDEV_DOWN) and when a device is unregistered (NETDEV_UNREGISTER). You can find the complete list of NETDEV_ XXX events in include/linux/notifier.h.

当设备变得不可用时,无论是因为它不再可用(例如,它已从内核中注销),还是因为它只是因管理原因而关闭,使用该设备的所有路由也将变得不可用。这意味着路由表和路由缓存都需要收到此类事件的通知并做出相应的反应。我们将在第 34 章中了解如何处理路由表。下面我们就来看看路由缓存是如何清理的。缓存路由的结构dst_entry 可以插入到以下两个位置之一:

When a device becomes unusable, either because it is not available anymore (for instance, it has been unregistered from the kernel), or because it has simply been shut down for administrative reasons, all the routes using that device become unusable as well. This means that both the routing tables and the routing cache need to be notified about this kind of event and react accordingly. We will see how the routing tables are handled in Chapter 34. Here we will see how the routing cache is cleaned up. The dst_entry structures for cached routes can be inserted in one of two places:

  • 路由缓存。

  • The routing cache.

  • 名单dst_garbage_list。在这里,删除的路由等待其所有引用被释放,以便有资格被垃圾收集过程删除。

  • The dst_garbage_list list. Here deleted routes wait for all their references to be released, to become eligible for deletion by the garbage collection process.

缓存中的条目由通知处理程序处理(在第 32章“对路由表的影响fib_netdev_event部分中描述),其中除其他操作外,通知处理程序还刷新缓存。列表中的内容 由 DST 向通知链注册的例程处理。如net/core/dst.c中的以下代码片段所示,DST 用于处理接收到的通知的处理程序是:dst_garbage_listnetdev_chaindst_dev_event

The entries in the cache are taken care of by the notification handler fib_netdev_event (described in the section "Impacts on the routing tables" in Chapter 32), which, among other actions, flushes the cache. The ones in the dst_garbage_list list are taken care of by the routine that DST registers with the netdev_chain notification chain. As shown in the following snippet from net/core/dst.c, the handler DST uses to process the received notifications is dst_dev_event:

静态结构notifier_block dst_dev_notifier = {
    .notifier_call = dst_dev_event,
};
 
无效__init dst_init(无效)
{
    register_netdevice_notifier(&dst_dev_notifier);
}
static struct notifier_block dst_dev_notifier = {
    .notifier_call = dst_dev_event,
};
 
void _ _init dst_init(void)
{
    register_netdevice_notifier(&dst_dev_notifier);
}

dst_dev_event浏览dst_garbage_listdst_entry结构列表并调用dst_ifdown每个结构。最后一个输入参数告诉dst_ifdown它正在调用它来处理什么事件。以下是它处理两种事件类型的方式:

dst_dev_event browses the dst_garbage_list list of dead dst_entry structures and invokes dst_ifdown for each one. The last input parameter to dst_ifdown tells it what event it is being called to handle. Here is how it handles the two event types:

NETDEV_UNREGISTER
NETDEV_UNREGISTER

当设备取消注册时,必须删除对其的所有引用。 对于结构及其关联 实例(如果有),dst_ifdown将它们替换为对环回设备的引用。[ * ]dst_entryneighbour

When the device is unregistered, all references to it have to be removed. dst_ifdown replaces them with references to the loopback device, for both the dst_entry structure and its associated neighbour instance, if any.[*]

NETDEV_DOWN
NETDEV_DOWN

由于设备已关闭,因此无法再向其发送流量。因此,inputoutput的例程分别dst_entry设置为dst_discard_indst_discard_out。这两个例程只是丢弃传递给它们的任何输入缓冲区(即,要求它们处理的任何帧)。

Because the device is down, traffic cannot be sent to it anymore. Therefore, the input and output routines of dst_entry are set to dst_discard_in and dst_discard_out, respectively. These two routines simply discard any input buffer passed to them (i.e., any frame they are asked to process).

我们在“ IPsec 转换和 dst_entry 的使用”一节中看到,一个dst_entry结构可以通过指针链接到其他结构childdst_ifdown 逐个更新并更新所有这些。和例程仅针对最后一个条目进行更新,因为该条目是使用例程进行接收或发送的条目inputoutput

We saw in the section "IPsec Transformations and the Use of dst_entry" that a dst_entry structure could be linked to other ones through the child pointer. dst_ifdown goes child by child and updates all of them. The input and output routines are updated only for the last entry, because that entry is the one that uses the routines for reception or transmission.

我们在第 8 章中看到,取消注册设备不仅会触发NETDEV_UNREGISTER 通知,还会NETDEV_DOWN触发通知,因为必须关闭设备才能取消注册。这意味着 处理的两个事件都会dst_dev_event在设备未注册时发生。这解释了为什么在设置参数时dst_ifdown检查其 参数并故意跳过其部分代码,而仅在设置时才运行其他部分。unregister

We saw in Chapter 8 that unregistering a device triggers not only a NETDEV_UNREGISTER notification but also a NETDEV_DOWN notification, because a device has to be shut down to be unregistered. This means that both events handled by dst_dev_event occur when a device is unregistered. This explains why dst_ifdown checks its unregister parameter and deliberately skips part of its code when the parameter is set, while running other parts only when it is set.

刷新路由缓存

Flushing the Routing Cache

每当系统发生可能导致缓存中的某些信息过时的更改时,内核就会刷新路由缓存。在许多情况下,只有选定的条目已过时,但为了简单起见,内核会删除所有条目。触发刷新的主要事件是:

Whenever a change in the system takes place that could cause some of the information in the cache to become out of date, the kernel flushes the routing cache. In many cases, only selected entries are out of date, but to keep things simple the kernel removes all entries. The main events that trigger flushing are:

设备启动或关闭
A device comes up or goes down

一些过去可以通过给定设备访问的地址可能不再可以访问,或者可以通过具有更好路由的不同设备访问。

Some addresses that used to be reachable through a given device may not be reachable anymore, or may be reachable through a different device with a better route.

IP 地址被添加到设备或从设备中删除
An IP address is added to or removed from a device

我们在第 32 章的“添加 IP 地址”和“删除 IP 地址”部分看到,Linux 为每个本地配置的 IP 地址创建一个特殊的路由。当地址被删除时,缓存中的任何关联路由也必须被删除。删除的地址很可能配置了与 /32 不同的网络掩码,因此与同一子网内的地址关联的所有缓存条目都应该消失[ * ]以及。最后,如果同一子网中的某个地址被用作其他间接路由的网关,则所有这些地址都应该消失。刷新整个缓存比跟踪所有这些可能的情况更简单。

We saw in the sections "Adding an IP address" and "Removing an IP address" in Chapter 32 that Linux creates a special route for each locally configured IP address. When an address is removed, any associated route in the cache also has to be removed. The removed address was most likely configured with a netmask different from /32, so all the cache entries associated with addresses within the same subnet should go away[*] as well. Finally, if one of the addresses in the same subnet was used as a gateway for other indirect routes, all of them should go away. Flushing the entire cache is simpler than keeping track of all of these possible cases.

全局转发状态或设备转发状态发生变化
The global forwarding status, or the forwarding status of a device, has changed

如果禁用转发,则需要删除所有用于转发流量的缓存路由。请参阅第 36 章中的“启用和禁用转发” 部分。

If you disable forwarding, you need to remove all the cached routes that were used to forward traffic. See the section "Enabling and Disabling Forwarding" in Chapter 36.

一条路线被删除
A route is removed

需要删除与已删除路由关联的所有缓存条目。

All the cached entries associated with the deleted route need to be removed.

通过/proc接口请求管理刷新
An administrative flush is requested via the /proc interface

这在第 36 章的“ /proc/sys/net/ipv4/route 目录”一节中进行了描述。

This is described in the section "The /proc/sys/net/ipv4/route Directory" in Chapter 36.

用于刷新缓存的例程是rt_run_flush,但从不直接调用它。刷新缓存的请求是通过 完成的rt_cache_flush,它将立即刷新缓存或启动计时器,具体取决于调用者提供的输入超时值:

The routine used to flush the cache is rt_run_flush, but it is never called directly. Requests to flush the cache are done via rt_cache_flush, which will either flush the cache right away or start a timer, depending on the value of the input timeout provided by the caller:

小于0
Less than 0

缓存在内核参数指定的秒数后被刷新,该参数可以通过/procip_rt_min_delay进行调整 ,如第 36 章“ /proc/sys/net/ipv4/route 目录”部分所述。

The cache is flushed after the number of seconds specified by the kernel parameter ip_rt_min_delay, which can be tuned via /proc as described in the section "The /proc/sys/net/ipv4/route Directory" in Chapter 36.

0
0

缓存立即被刷新。

The cache is flushed right away.

大于0
Greater than 0

在指定的时间后刷新缓存。

The cache is flushed after the specified amount of time.

一旦提交刷新请求,刷新保证在几秒内发生ip_rt_max_delay,默认设置为 8。当提交刷新请求并且已经有一个待处理请求时,定时器将重新启动以反映新请求;但是,新请求不能要求计时器在上ip_rt_max_delay一个计时器被触发后的秒数之后到期。这是通过使用全局变量来完成的rt_deadline

Once a flush request is submitted, a flush is guaranteed to take place within ip_rt_max_delay seconds, which is set to 8 by default. When a flush request is submitted and there is already one pending, the timer is restarted to reflect the new request; however, the new request cannot ask the timer to expire later than ip_rt_max_delay seconds since the previous timer was fired. This is accomplished by using the global variable rt_deadline.

此外,缓存会通过周期性计时器定期刷新,rt_secret_timer该计时器每秒到期(有关其默认值,请参阅第 36 章中的“ /proc/sys/net/ipv4/route 目录ip_rt_secret_interval一节。当计时器到期时,处理程序刷新缓存并重新启动计时器。 可通过/proc进行配置。)rt_secret_rebuildip_rt_secret_interval

In addition, the cache is periodically flushed by means of a periodic timer, rt_secret_timer, that expires every ip_rt_secret_interval seconds (see the section "The /proc/sys/net/ipv4/route Directory" in Chapter 36 for its default value). When the timer expires, the handler rt_secret_rebuild flushes the cache and restarts the timer. ip_rt_secret_interval is configurable via /proc.

垃圾收集

Garbage Collection

正如第30章“路由缓存垃圾收集”一节中所解释的,垃圾收集有两种:

As explained in the section "Routing Cache Garbage Collection" in Chapter 30, there are two kinds of garbage collection:

  • 当检测到内存不足时释放内存。这实际上分为两个任务,一个同步,一个异步。同步任务由特定条件在不规则的时间触发,而异步任务在计时器到期时或多或少有规律地运行。

  • To free memory when a shortage is detected. This is actually split into two tasks, one synchronous and one asynchronous. The synchronous task is triggered at irregular times by particular conditions, and the asynchronous task runs more or less regularly at the expiration of a timer.

  • 清理dst_entry内核要求删除的结构,但无法立即删除,因为有人仍然持有对它们的引用。

  • To clean up dst_entry structures that the kernel asked to be removed, but that could not be deleted right away because someone still held a reference to them.

本节涵盖第一种类型的垃圾收集的同步和异步情况。“删除 DST 条目”部分详细介绍了另一种类型。

This section covers both the synchronous and asynchronous cases of the first type of garbage collection. The section "Deleting DST Entries" goes into detail on the other type.

同步和异步垃圾收集都使用通用例程来决定给定实例是否dst_entry符合删除条件:rt_may_expire。该例程接受两个参数 ( tmo1, tmo2),它们表示候选者在符合删除资格之前必须在缓存中花费的最短时间。具体来说,tmo2适用于那些被认为特别适合删除的候选者,并tmo1适用于所有其他候选者,如第 30 章“合格缓存受害者的示例”部分中所述。该参数指定缓存中其他条目的时间。ip_rt_gc_timeout

Both synchronous and asynchronous garbage collection use a common routine to decide whether a given dst_entry instance is eligible for deletion: rt_may_expire. The routine accepts two parameters (tmo1, tmo2) that represent the minimum time that candidates must have spent in the cache before being eligible for deletion. Specifically, tmo2 applies to those candidates that are considered particularly good for deletion, and tmo1 applies to all the other candidates, as described in the section "Examples of eligible cache victims" in Chapter 30. The ip_rt_gc_timeout parameter specifies the time for other entries in the cache.

这两个值越低,条目被删除的可能性就越大。这就是为什么,如“异步清理”部分所示,每次未删除条目时都会rt_check_expire将局部变量减半。正如我们将在“ rt_garbage_collect 函数tmo”部分中看到的,对两个阈值执行相同的操作。rt_garbage_collect

The lower those two values are, the more likely it is that entries will be deleted. That's why, as shown in the section "Asynchronous Cleanup," rt_check_expire halves the local variable tmo every time an entry is not removed. As we will see in the section "rt_garbage_collect Function," rt_garbage_collect does the same with both thresholds.

同步清理

Synchronous Cleanup

当 DST 子系统检测到内存不足时,会触发同步清理。虽然由 DST 决定何时触发垃圾收集,但负责处理垃圾收集的例程由拥有缓存的协议提供。一切都通过“ DST 和调用协议之间的接口dst_ops”部分中介绍的虚拟函数进行控制 。我们看到那里有一个名为 的函数,IPv4 将其初始化为 。在以下两种情况下被调用:dst_opsgcrt_garbage_collectgc

A synchronous cleanup is triggered when the DST subsystem detects a shortage of memory. While it is up to the DST to decide when to trigger garbage collection, the routine that takes care of it is provided by the protocol that owns the cache. Everything is controlled through the dst_ops virtual functions introduced in the section "Interface Between the DST and Calling Protocols." We saw there that dst_ops has a function called gc, which IPv4 initializes to rt_garbage_collect. gc is invoked in the following two cases:

  • 当新条目添加到路由缓存时并出现内存不足的情况。添加条目时,rt_intern_hash必须将路由绑定到neighbour与下一跳关联的数据结构(请参阅“将路由缓存绑定到 ARP 缓存”部分)。如果没有足够的内存来分配新的neighbour数据结构,则会扫描路由缓存以尝试释放一些内存。这样做是因为可能存在一些暂时未使用的缓存条目,并且删除它们neighbour也可以允许删除关联的条目。(我说“可以”允许它,因为众所周知,在删除对数据结构的所有引用之前,无法删除该数据结构。)

  • When a new entry is added to the routing cache and a memory shortage comes up. When adding an entry, rt_intern_hash has to bind the route to the neighbour data structure associated with the next hop (see the section "Binding the Route Cache to the ARP Cache"). If there is not enough memory to allocate a new neighbour data structure, the routing cache is scanned in an attempt to free some memory. This is done because there could be some cache entries that have not been used for a while, and removing them could allow the associated neighbour entries to be removed, too. (I said "could" allow it, because as we know, a data structure cannot be removed until all the references to it have been removed.)

  • 当新条目添加到路由缓存并且条目总数超过阈值时gc_thresh。分配条目的函数dst_alloc会触发清理,通过将缓存限制为固定大小来减少内存使用。gc_thresh可通过/proc进行配置(请参阅第 36 章中的“通过 /proc 文件系统进行调整”部分)。

  • When a new entry is added to the routing cache and the total number of entries exceeds the threshold gc_thresh. The dst_alloc function that allocates the entry triggers a cleanup to keep down memory use by restricting the cache to a fixed size. gc_thresh is configurable via /proc (see the section "Tuning via /proc Filesystem" in Chapter 36).

下一节给出了 的内部结构rt_garbage_collect

The next section gives the internals of rt_garbage_collect.

rt_garbage_collect 函数

rt_garbage_collect Function

其逻辑如图33-8(a)33-8(b)rt_garbage_collect所示。

The logic of rt_garbage_collect is described in Figures 33-8(a) and 33-8(b).

rt_garbage_collect就 CPU 时间而言,例程完成的垃圾收集非常昂贵。因此,如果自上次调用以来经过的时间少于ip_rt_gc_min_interval秒,则例程将返回而不执行任何操作,除非缓存中的条目数达到最大值ip_rt_max_size(需要立即关注)。

The garbage collection done by the rt_garbage_collect routine is expensive in terms of CPU time. Therefore, the routine returns without doing anything if less than ip_rt_gc_min_interval seconds have passed since the last invocation, unless the number of entries in the cache reached the maximum value ip_rt_max_size, which requires immediate attention.

rt_garbage_collect函数

图 33-8a。rt_garbage_collect函数

Figure 33-8a. rt_garbage_collect function

ip_rt_max_size是一个硬限制。一旦达到该阈值,dst_alloc就会失败,直到rt_garbage_collect设法释放一些内存。

ip_rt_max_size is a hard limit. Once that threshold is reached, dst_alloc fails until rt_garbage_collect manages to free some memory.

这是逻辑结构rt_garbage_collect

Here is the logical structure of rt_garbage_collect:

rt_garbage_collect函数

图 33-8b。rt_garbage_collect函数

Figure 33-8b. rt_garbage_collect function

  • 首先,它计算要删除的缓存条目的数量 ( goal)。根据该值和缓存 ( ) 中当前的条目数,它可以得出条目被删除ipv4_dst_ops.entries后剩余的条目数,并将该数字存储在 中。goalequilibrium

  • First it computes the number of cache entries it would like to remove (goal). From this value and the number of entries currently in the cache (ipv4_dst_ops.entries), it derives the number of entries that would be left once goal entries are removed, and stores this number in equilibrium.

  • 它浏览哈希表并尝试使最符合条件的条目过期,并使用 来检查它们的资格rt_may_expirert_free符合删除条件的条目可以直接使用 或 使用 来删除rt_remove_balanced_route,具体取决于它们是否与多路径路由关联(请参阅“辅助例程”部分)。

  • It browses the hash table and tries to expire the most-eligible entries, checking their eligibility with rt_may_expire. Entries eligible for deletion are deleted with rt_free directly or with rt_remove_balanced_route, depending on whether they are associated with multipath routes (see the section "Helper Routines").

  • 一旦表被完全扫描,它就会检查是否达到了目标,如果没有,它会使用更严格的资格标准重复循环。

  • Once the table has been scanned completely, it checks whether the goal has been met, and if not, it repeats the loop with more-aggressive eligibility criteria.

要删除的条目数 ( goal) 取决于哈希表的负载程度。目标是当表负载较重时使条目更快地过期。

The number of entries to remove (goal) depends on how heavily loaded the hash table is. The goal is to expire entries faster when the table is more heavily loaded.

在图 33-9的帮助下,让我们澄清一些用于rt_garbage_collect定义的阈值goal

With the help of Figure 33-9, let's clarify some of the thresholds used by rt_garbage_collect to define goal:

  • 哈希表的大小为rt_hash_mask+1,即 2 rt_hash_logrt_garbage_collect当缓存中的条目数大于 时调用gc_thresh,其默认值为哈希表的大小。

  • The size of the hash table is rt_hash_mask+1, or 2rt_hash_log. rt_garbage_collect is called when the number of entries in the cache is bigger than gc_thresh, whose default value is the size of the hash table.

  • 缓存可以容纳的最大条目数为ip_rt_max_size,默认设置为哈希表大小的 16 倍。

  • The maximum number of entries that the cache can hold is ip_rt_max_size, which by default is set to 16 times the size of the hash table.

  • 当缓存中的条目数大于ip_rt_gc_elasticity*(2 rt_hash_log )(默认情况下是哈希表大小的八倍)时,缓存被视为危险地大,并且垃圾收集开始goal更积极地设置。

  • When the number of entries in the cache is bigger than ip_rt_gc_elasticity*(2rt_hash_log), which by default is eight times the size of the hash table, the cache is considered to be dangerously large and the garbage collection starts setting goal more aggressively.

垃圾收集阈值

图 33-9。垃圾收集阈值

Figure 33-9. Garbage collection thresholds

定义阈值后,rt_garbage_collect浏览哈希表元素寻找受害者。该表并不是简单地从第一个桶浏览到最后一个桶。rt_garbage_collect保留一个静态变量 ,rover它会记住上次调用时扫描的最后一个存储桶。这是因为表不一定需要被完全扫描。通过记住最后扫描的存储桶,例程公平地处理所有存储桶,而不是总是从第一个存储桶中选择受害者。受害者的身份由 确定rt_may_expire这个例程,已经在“垃圾收集”一节中描述过,”通过两个时间阈值,定义如何将两类条目视为符合删除条件。在扫描存储桶的元素时,每次未选择元素时,阈值之一都会降低(减半)。在每次扫描结束时存储桶的列表中,该函数再次检查删除条目的数量是否满足函数开始时设定的目标(goal)。如果没有,该函数继续处理下一个存储桶。这样一直持续到整个表都被扫描完毕。此时点,该函数降低传递给的第二个时间阈值的值rt_max_expire,使其更有可能找到符合条件的受害者。然后,如果不太耗时的话,就会开始对表进行新的扫描。如果在软件中断上下文中调用例程,或者如果先前的扫描花费了超过一倍的时间jiffies(例如,在 x86 平台上为 1/1000 秒),则新的扫描被认为过于耗时,并且会被跳过。

Once the thresholds have been defined, rt_garbage_collect browses the hash table elements looking for victims. The table is not simply browsed from the first to the last bucket. rt_garbage_collect keeps a static variable, rover, that remembers the last bucket that was scanned at the previous invocation. This is because the table does not necessarily need to be scanned completely. By remembering the last scanned bucket, the routine handles all the buckets fairly, instead of always selecting victims from the first buckets. Victims are identified by rt_may_expire. This routine, already described in the section "Garbage Collection," is passed two time thresholds that define how two categories of entries should be considered eligible for deletion. While scanning elements of a bucket, one of the thresholds is lowered (halved) every time an element is not selected. At the end of each bucket's list, the function checks again whether the number of deleted entries meets the goal set at the beginning of the function (goal). If not, the function goes ahead with the next bucket. This continues until the whole table has been scanned. At that point, the function lowers the value of the second time threshold passed to rt_max_expire, to make it even more likely to find eligible victims. Then a new scan over the table starts, if it would not be too time consuming. The new scan is considered too time consuming and is skipped if the routine was called in software interrupt context, or if the previous scan took more than one jiffies of time (e.g., 1/1000 of a second on an x86 platform).

异步清理

Asynchronous Cleanup

同步垃圾回收用于处理内存不足的特定情况;但最好避免等到极端情况出现才采取行动:换句话说,最好是降低出现极端情况的可能性。这就是异步清理通过周期性计时器所做的事情。

Synchronous garbage collection is used to handle specific cases of memory shortage; but it would be better to avoid waiting for extreme conditions to emerge before taking action: in other words, it is better to make extreme conditions less likely. This is what the asynchronous cleanup does by means of a periodic timer.

计时器rt_periodic_timer是在 ip_rt_init路由子系统初始化时启动的,并rt_check_expire 在每次到期时调用处理程序。每次调用时,rt_check_expire仅扫描缓存的一部分。它保留一个静态变量 ( rover) 来记住上一次调用时扫描的最后一个存储桶,并每次从下一个存储桶开始扫描。rt_check_expire重新启动计时器,并在扫描完整个表或运行至少一张表后返回jiffies

The timer, rt_periodic_timer, is started by ip_rt_init when the routing subsystem is initialized, and invokes the handler rt_check_expire every time it expires. Each time it is invoked, rt_check_expire scans just a part of the cache. It keeps a static variable (rover) to remember the last bucket it scanned at the previous invocation and starts scanning each time from the next one. rt_check_expire restarts the timer and returns when it has finished scanning the entire table or has run for at least one jiffies.

rt_free如果条目在缓存中的时间已过期,或者如果 认为条目符合条件,则条目将被删除rt_may_expire。当条目与多路径路由关联时,删除由 负责rt_remove_balanced_route

Entries are removed with rt_free if their time in the cache has expired, or if they are considered eligible by rt_may_expire. When the entry is associated with a multipath route, the deletion is taken care of by rt_remove_balanced_route.

        while ((rth = *rthp) != NULL) {
            if (rth->u.dst.expires) {
                if (time_before_eq(现在, rth->u.dst.expires)) {
                    tmo >>= 1;
                    rthp = &rth->u.rt_next;
                    继续;
                }
            } 否则 if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
                 tmo >>= 1;
                 rthp = &rth->u.rt_next;
                 继续;
            }
            /* 清除老化的条目。*/
#ifdef CONFIG_IP_ROUTE_MULTIPATH_CACHED
            /* 如有必要,删除所有相关的平衡条目 */
            if (rth->u.dst.flags & DST_BALANCED) {
                rthp = rt_remove_balanced_route(
                    &rt_hash_table[i].chain,
                    rth,空);
                如果(!rthp)
                    休息;
            } 别的 {
                *rthp = rth->u.rt_next;
                rt_free(rth);
            }
#else /* CONFIG_IP_ROUTE_MULTIPATH_CACHED */
             *rthp = rth->u.rt_next;
             rt_free(rth);
#endif /* CONFIG_IP_ROUTE_MULTIPATH_CACHED */
        }
        ...
        if (time_after(jiffies, now)))
                休息;
        while ((rth = *rthp) != NULL) {
            if (rth->u.dst.expires) {
                if (time_before_eq(now, rth->u.dst.expires)) {
                    tmo >>= 1;
                    rthp = &rth->u.rt_next;
                    continue;
                }
            } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
                 tmo >>= 1;
                 rthp = &rth->u.rt_next;
                 continue;
            }
            /* Cleanup aged off entries. */
#ifdef CONFIG_IP_ROUTE_MULTIPATH_CACHED
            /* remove all related balanced entries if necessary */
            if (rth->u.dst.flags & DST_BALANCED) {
                rthp = rt_remove_balanced_route(
                    &rt_hash_table[i].chain,
                    rth, NULL);
                if (!rthp)
                    break;
            } else {
                *rthp = rth->u.rt_next;
                rt_free(rth);
            }
#else /* CONFIG_IP_ROUTE_MULTIPATH_CACHED */
             *rthp = rth->u.rt_next;
             rt_free(rth);
#endif /* CONFIG_IP_ROUTE_MULTIPATH_CACHED */
        }
        ...
        if (time_after(jiffies, now)))
                break;

计时器默认每秒到期一次ip_rt_gc_interval,默认值为 60,但可以通过/proc/sys/net/ipv4/route/gc_interval文件更改(请参阅第 4 章中的“通过 /proc 文件系统进行调整”部分)。第一次触发计时器时,它被设置为在和 2*之间的随机秒数后过期 (请参阅)。使用随机值的原因是为了避免来自不同内核子系统的计时器可能同时到期并耗尽 CPU 的可能性。如果许多子系统在引导过程中同时启动并定期安排时间,这是可以想象的。ip_rt_gc_intervalip_rt_gc_intervalip_rt_init

The timer expires by default every ip_rt_gc_interval seconds, whose value is 60 by default but can be changed via the /proc/sys/net/ipv4/route/gc_interval file (see the section "Tuning via /proc Filesystem" in Chapter 4). The first time the timer fires, it is set to expire after a random number of seconds between ip_rt_gc_interval and 2*ip_rt_gc_interval (see ip_rt_init). The reason for using the random value is to avoid the possibility that timers from different kernel subsystems might expire at the same time and use up the CPU. This is conceivable if many subsystems start up at the same time during the boot process and schedule times at regular intervals.

过期标准

Expiration Criteria

默认情况下,路由缓存条目永远不会过期,因为dst_entry->expires为 0。[ * ]当发生可以使缓存条目过期的事件时(请参阅第 30 章中的“可以使缓存条目过期的事件示例”部分),通过设置其 时间戳来使条目过期使用[ ]将字段设置为非零值:dst_entry->expiresdst_set_expires

By default, routing cache entries never expire because dst_entry->expires is 0.[*] When an event that can expire cache entries occurs (see the section "Examples of events that can expire cache entries" in Chapter 30), entries are expired by setting their dst_entry->expires timestamp field to a nonzero value with dst_set_expires []:

  • 当收到ICMP UNREACHABLE或FRAGMENTATION NEEDED报文时,所有相关路由(与ICMP报文负载中的IP头指定的目的IP相同的路由)的PMTU必须更新为指定的MTU在 ICMP 标头中。因此,ICMP核心代码调用 ip_rt_frag_needed更新路由缓存。受影响的条目设置为在可配置时间后过期ip_rt_mtu_expires,默认情况下为 10 分钟,可以使用/proc/sys/net/route/mtu_expires进行更改。详细信息请参见第 25 章。

  • When an ICMP UNREACHABLE or FRAGMENTATION NEEDED message is received, the PMTU of all the related routes (those that have the same destination IP as the one specified by the IP header carried in the payload of the ICMP message) must be updated to the MTU specified in the ICMP header. Thus, the ICMP core code calls ip_rt_frag_needed to update the routing cache. The affected entries are set to expire after the configurable time ip_rt_mtu_expires, which by default is 10 minutes and can be changed with /proc/sys/net/route/mtu_expires. See Chapter 25 for more details.

  • 当 TCP 代码使用路径 MTU 发现算法更新路由的 MTU 时,它会调用该ip_rt_update_mtu 函数,该函数又会调用dst_set_expires。有关路径 MTU 发现的更多详细信息,请参阅第 18 章。

  • When the TCP code updates the MTU of a route with the path MTU discovery algorithm, it calls the ip_rt_update_mtu function, which in turns calls dst_set_expires. Refer to Chapter 18 for more details on path MTU discovery.

  • 当目标IP地址被分类为不可达时, 通过直接或间接调用数据结构的方法dst_entry将缓存中的关联结构标记为不可达(参见“ DST和调用协议之间的接口” 部分)。link_failuredst_ops

  • When a destination IP address is classified as unreachable, the associated dst_entry structure in the cache is marked as unreachable by directly or indirectly calling the link_failure method of the dst_ops data structure (see the section "Interface Between the DST and Calling Protocols").

删除 DST 条目

Deleting DST Entries

在前面的部分中,我们了解了如何rtable通过同步或异步清理以及后台垃圾收集来删除缓存条目。dst_entry在本节中,我们将了解如何处理嵌入结构。删除 a 的函数dst_entrydst_free.

In the previous sections, we saw how rtable cache entries are deleted by synchronous or asynchronous cleanups and background garbage collection. In this section, we will see how the embedded dst_entry structures are taken care of. The function that deletes a dst_entry is dst_free.

a 上的引用计数分别dst_entry用 和 递增dst_hold和递减dst_release。但是当dst_release调用释放最后一个引用时,该条目不会自动删除。rtable相反,当使用rt_free和删除关联结构时,它会被间接删除rt_dropdst_free这些函数调度via的执行dst_rcu_free,它负责 RCU 机制(请参阅“缓存锁定”部分)。

The reference count on a dst_entry is incremented and decremented with dst_hold and dst_release, respectively. But when dst_release is called to release the last reference, the entry is not deleted automatically. Instead, it is removed indirectly when the associated rtable structures are removed with rt_free and rt_drop. These functions schedule the execution of dst_free via dst_rcu_free, which takes care of the RCU mechanisms (see the section "Cache Locking").

我们在“ IPsec 转换和 dst_entry 的使用”一节中看到,dst_entry结构并不总是嵌入到rtable结构中。通过直接调用来删除独立实例dst_free

We saw in the section "IPsec Transformations and the Use of dst_entry" that dst_entry structures are not always embedded into rtable structures. Standalone instances are removed by calling dst_free directly.

删除 adst_entry并不复杂,但需要了解以下几点才能了解dst_free其辅助例程的工作原理:

The removal of a dst_entry is not complex, but there are a couple of points that need to be covered to understand how dst_free and its helper routines work:

  • 当一个条目因为仍然被引用而无法删除时,通过将其obsolete标志设置为 2(默认值为dst->obsolete0)将其标记为死亡。尝试删除已失效的条目失败。

  • When an entry cannot be removed because it is still referenced, it is marked as dead by setting its obsolete flag to 2 (the default value for dst->obsolete is 0). An attempt to delete an entry that is already dead fails.

  • 正如我们在“ IPsec 转换和 dst_entry 的使用”部分中看到的,一个dst_entry实例可以有子实例。当删除列表中的第一个 dst_entry时,路由子系统也必须删除所有其他列表。但同时,您需要记住,只要保留一些引用,就无法删除任何条目。

  • As we saw in the section "IPsec Transformations and the Use of dst_entry," a dst_entry instance could have children. When deleting the first dst_entry of a list, the routing subsystem has to delete all the others as well. But at the same time, you need to keep in mind that any entry cannot be removed so long as some references are left to it.

考虑到这两点,现在让我们看看如何dst_free工作。

Given these two points, let's see now how dst_free works.

dst_free调用 来删除引用计数为 0 的条目时,它会立即删除该条目dst_destroy。后一个函数还尝试删除链接到该结构的任何子项。当其中一个子级由于仍被引用而无法被删除时,dst_destroy返回一个指向子级的指针,以便dst_free可以处理它。

When dst_free is called to remove an entry whose reference count is 0, it removes the entry right away with dst_destroy. The latter function also tries to remove any children linked to the structure. When one of the children cannot be removed because it is still referenced, dst_destroy returns a pointer to the child so that dst_free can take care of it.

dst_free调用 来删除引用计数不为 0 的条目时(包括刚才描述的情况,当dst_destroy无法删除子项时),它会执行以下操作:

When dst_free is called to remove an entry whose reference count is not 0—which includes the case just described, when dst_destroy could not delete a child—it does the following:

  • 通过设置其标志将条目标记为已死obsolete

  • Marks the entry as dead by setting its obsolete flag.

  • 用两个假的条目和例程替换条目input和例程。这些确保在相关路由上不会尝试接收或发送(参见第 36 章“ dst_entry 结构” 部分中的描述)。此初始化是尚未运行或处于关闭状态(未设置标志)的设备的典型情况。outputdst_discard_indst_discard_outinputoutputIFF_UP

    我们在“外部事件”一节中看到,当由 处理的两个事件dst_dev_event发生时,会调用来dst_ifdown处理. 特别是,它将当前的和方法替换为和。这并不是多余的,因为仅当调用释放与正在关闭的设备相关联时才执行此操作,而当其中一个事件发生时,情况不一定总是如此。dst_entrydst_garbage_listinputoutputdst_discard_indst_discard_outdst_freedst_entrydst_dev_event

  • Replaces the entry's input and output routines with two fake ones, dst_discard_in and dst_discard_out. These ensure that no reception or transmission is attempted on the associated routes (see the description of input and output in the section "dst_entry Structure" in Chapter 36). This initialization is typical of a device that is not yet operative, or in a down state (the flag IFF_UP is not set).

    We saw in the section "External Events" that when the two events handled by dst_dev_event occur, dst_ifdown is called to take care of the dst_entry structures in the dst_garbage_list. In particular, it replaces their current input and output methods with dst_discard_in and dst_discard_out. This is not superfluous, because dst_free does this only when the dst_entry it is called to free is associated with a device being shut down, which is not necessarily always the case when one of the dst_dev_event events occurs.

  • 将结构添加到全局列表中dst_garbage_list。此列表链接了所有应删除但由于非零引用计数而无法删除的条目。

  • Adds the structure to the global list dst_garbage_list. This list links all entries that should be removed, but cannot be removed yet due to nonzero reference counts.

  • dst_gc_timer计时器调整为在最小可配置延迟 ( DST_GC_MIN) 之后到期,如果计时器尚未运行则将其触发。

  • Adjusts the dst_gc_timer timer to expire after the minimum configurable delay (DST_GC_MIN) and fires it if it is not already running.

计时器dst_gc_timer定期浏览 dst_garbage_list列表并删除dst_destroy引用计数为 0 的条目。当计时器处理程序dst_run_gc无法删除列表中的所有条目时,它会再次启动计时器,但使其稍后到期。准确地说,它会增加DST_GC_INC其过期延迟数秒,最大延迟为DST_GC_MAX。但每次dst_free添加新元素时dst_garbage_list,它都会将到期延迟重置为默认最小值DST_GC_MIN

The dst_gc_timer timer periodically browses the dst_garbage_list list and removes, with dst_destroy, entries with a reference count of 0. When the timer handler dst_run_gc cannot remove all the entries in the list, it starts the timer again but makes it expire a little later. To be precise, it adds DST_GC_INC seconds to its expiration delay, up to a maximum delay of DST_GC_MAX. But each time dst_free adds a new element to dst_garbage_list, it resets the expiry delay to the default minimum value DST_GC_MIN.

图 33-10(a) 和 33-10(b) 总结了 的逻辑dst_free

Figures 33-10(a) and 33-10(b) summarize the logic of dst_free.

调整和控制垃圾收集的变量

Variables That Tune and Control Garbage Collection

综上所述,控制DST垃圾回收任务的主要全局变量和参数的含义如下:

In summary, here are the meanings of the main global variables and parameters that control the DST garbage collection task:

dst_garbage_list
dst_garbage_list

dst_entry等待删除的结构列表。过期后dst_gc_timer,处理程序会处理它们。_ _refcnt仅当引用计数大于 0时,条目才会放入此列表(而不是直接删除) ,以防止删除它们。新条目将插入到列表的头部。

The list of dst_entry structures waiting to be removed. When dst_gc_timer expires, the handler takes care of them. Entries are put into this list (instead of being removed directly) only when the reference count _ _refcnt is greater than 0, preventing their deletion. New entries are inserted at the head of the list.

dst_gc_timer_expires
dst_gc_timer_expires

dst_gc_timer_inc
dst_gc_timer_inc

dst_gc_timer_expires是计时器在到期之前等待的秒数。它的值范围在DST_GC_MIN和 之间,并且每次该函数运行时都会以DST_GC_MAXby 为单位增加 ,并且无法清空列表。也必须在范围内。dst_gc_timer_incdst_run_gcdst_garbage_listdst_gc_timer_incDST_GC_MINDST_GC_MAX

dst_gc_timer_expires is the number of seconds the timer waits before expiring. Its value ranges between DST_GC_MIN and DST_GC_MAX and is increased with units of dst_gc_timer_inc by dst_run_gc every time that function runs and cannot manage to empty the dst_garbage_list list. dst_gc_timer_inc must be in the range DST_GC_MIN to DST_GC_MAX as well.

dst_free 函数

图 33-10a。dst_free 函数

Figure 33-10a. dst_free function

dst_free 函数

图 33-10b。dst_free 函数

Figure 33-10b. dst_free function

表 33-1中列出了前面项目符号中提到的三个常量的值(如 include/net/dst.h中所定义) 。

The values of the three constants mentioned in the previous bullets, as defined in include/net/dst.h, are listed in Table 33-1.

表 33-1。DST_GC_XXX 常量

Table 33-1. DST_GC_XXX constants

姓名

Name

价值

Value

DST_GC_MIN

DST_GC_MIN

赫兹/10

HZ/10

DST_GC_MAX

DST_GC_MAX

120*赫兹

120*HZ

DST_GC_INC

DST_GC_INC

赫兹/2

HZ/2

出口 ICMP 重定向速率限制

Egress ICMP REDIRECT Rate Limiting

正如第 25 章中所讨论的,当内核检测到次优路由时,它会生成 ICMP REDIRECT 消息。这些 ICMP 消息由路由子系统处理,路由子系统按照 RFC 1812 第 4.3.2.8 节的建议对它们进行速率限制。

As discussed in Chapter 25, the kernel generates ICMP REDIRECT messages when it detects suboptimal routing. These ICMP messages are handled by the routing subsystem, which rate limits them as suggested by section 4.3.2.8 of RFC 1812.

使用的算法是简单的指数退避算法。如果目标继续忽略 ICMP REDIRECT 消息,内核将继续发送它们直至ip_rt_redirect_number,每次连续消息之间的间隔加倍。ip_rt_redirect_number发送此类消息后,内核会停止发送它们ip_rt_redirect_silence,直到几秒钟后没有输入数据包到达,从而触发 ICMP 重定向的生成。一旦ip_rt_redirect_silence经过几秒钟,内核就会开始再次发送 ICMP REDIRECT 消息(如果需要)。

The algorithm used is a simple exponential backoff algorithm. If the destination keeps ignoring ICMP REDIRECT messages, the kernel keeps sending them up to ip_rt_redirect_number, doubling each time the interval between consecutive messages. After ip_rt_redirect_number such messages have been sent, the kernel stops sending them until ip_rt_redirect_silence seconds pass while no input packet arrives that would trigger the generation of an ICMP REDIRECT. Once ip_rt_redirect_silence seconds are passed, the kernel starts sending ICMP REDIRECT messages again, if they are needed.

指数退避算法的初始延迟由 给出ip_rt_redirect_load。所有三个ip_rt_redirect_ xxx参数都可以通过/proc配置。有关这些变量的默认值,请参阅第 36 章。

The initial delay for the exponential backoff algorithm is given by ip_rt_redirect_load. All three ip_rt_redirect_ xxx parameters are configurable via /proc. See Chapter 36 for the default values of those variables.

出口重定向消息的所有逻辑都在 中实现ip_rt_send_redirect,这是内核在检测到需要 ICMP 重定向时调用的例程(请参阅第 20 章)。

All the logic for egress REDIRECT messages is implemented in ip_rt_send_redirect, which is the routine called by the kernel when it detects the need for an ICMP REDIRECT (see Chapter 20).

有两个dst_entry字段实现此功能:

Two dst_entry fields implement this feature:

rate_last
rate_last

最后发送 IMCP REDIRECT 时的时间戳。

Timestamp when the last IMCP REDIRECT was sent.

rate_tokens
rate_tokens

已发送到与此dst_entry实例关联的目标的 ICMP REDIRECT 消息数。rate_tokens-1因此,表示目标已忽略的连续 ICMP REDIRECT 消息的数量。

Number of ICMP REDIRECT messages already sent to the destination associated to this dst_entry instance. rate_tokens-1, therefore, represents the number of consecutive ICMP REDIRECT messages that the destination has ignored.




[ * ]参见第32章“主要数据结构flowi部分的结构描述。

[*] See the description of the flowi structure in the section "Main Data Structures" in Chapter 32.

[ * ] TOS字段,如第18章18-3所示,是一个8位字段,其中位0被忽略,位1至7被使用。但是,路由代码仅使用位 1、2、3 和 4。它不考虑出口路由的组件(位 5、6、7)。这些位被宏屏蔽掉了。precedenceRT_TOS

[*] The TOS field, as shown in Figure 18-3 in Chapter 18, is an 8-bit field, of which bit 0 is ignored and bit 1 through 7 are used. However, the routing code uses only the bits 1, 2, 3 and 4. It does not take the precedence component (bits 5, 6, 7) into consideration for egress routes. Those bits are masked out with the macro RT_TOS.

[ * ]请参阅第 27 章中的“ L2 标头缓存”部分。

[*] See the section "L2 Header Caching" in Chapter 27.

[ * ]当您删除辅助地址时,情况并非如此。请参阅第 32 章中的“删除 IP 地址” 部分。

[*] This is not true when you remove a secondary address. See the section "Removing an IP address" in Chapter 32.

[ * ]该字段通过全局dst_entry->expires调用设置 。dst_allocmemset

[*] The dst_entry->expires field is set in dst_alloc with a global memset call.

[ ]请注意,当dst_set_expires调用使条目立即过期时,它将输入值 0 替换为 1,以区分这种情况与表示永不过期的 0。

[] Note that when dst_set_expires is called to expire an entry immediately, it replaces the input value of 0 with 1, to distinguish this situation from the 0 that means never to expire.

第 34 章路由:路由表

Chapter 34. Routing: Routing Tables

鉴于路由在网络堆栈中的核心作用以及路由表的大小,有效设计路由表以加速操作(尤其是查找)非常重要。本章介绍 Linux 如何组织路由表,以及如何使用不同的哈希表访问组成路由表的数据结构,每个哈希表专门用于不同类型的查找。

Given the central role of routing in the network stack and how big routing tables can be, it is important to have efficiently designed routing tables to speed up operations, particularly lookups. This chapter describes how Linux organizes routing tables, and how the data structures that compose a routing table are accessed with different hash tables, each one specialized for a different kind of lookup.

路由哈希表的组织

Organization of Routing Hash Tables

为了支持快速返回各种操作信息的关键目标,Linux 定义了许多不同的哈希表,它们指向描述路由的相同数据结构:

To support the key goal of returning information quickly for a wide variety of operations, Linux defines a number of different hash tables that point to the same data structures describing routes:

每个网络掩码表的组织

Organization of Per-Netmask Tables

在最高级别,路由根据其网络掩码的长度被组织到不同的哈希表中。由于 IPv4 使用 32 位地址,因此可以将 33 种不同的网络掩码长度(范围从 /0 到 /32,其中 /0 代表默认路由)与 IP 地址关联。路由子系统为每个网络掩码长度维护一个不同的哈希表。然后将这些哈希表组合成其他表,这将在本章后续部分中进行描述。

At the highest level, routes are organized into different hash tables based on the lengths of their netmasks. Because IPv4 uses 32-bit addresses, 33 different netmask lengths (ranging from /0 to /32, where /0 represents default routes) can be associated with an IP address. The routing subsystem maintains a different hash table for each netmask length. These hash tables are then combined into other tables, described in subsequent sections in this chapter.

图34-1显示了路由表中主要数据结构之间的关系。所有这些数据结构都在第32章中进行了简要介绍,并在第36章中进行了详细描述。在本章中,我们将集中讨论数据结构之间的关系。

Figure 34-1 shows the relationships between the main data structures in a routing table. All of these data structures were briefly introduced in Chapter 32, and are described in detail in Chapter 36. In this chapter, we will concentrate on the relationships between the data structures.

哈希表组织的基本结构

Basic structures for hash table organization

路由表是用fib_table数据结构来描述的。该fib_table结构包括一个由 33 个指针组成的向量,每个指针对应一个网络掩码,每个指针都指向 类型的数据结构fn_zone。(术语区域 指共享单个网络掩码的网络。)这些fn_zone结构将路由组织到哈希表中,因此通向具有相同网络掩码长度的目标网络的路由共享相同的哈希表。因此,给定任何路由,可以通过路由的网络掩码长度快速识别其关联的哈希表。非fn_zone空桶链接在一起,链表头保存在fn_zone_list. 我们将在第 35 章中看到如何使用这个列表。

Routing tables are described with fib_table data structures. The fib_table structure includes a vector of 33 pointers, one for each netmask, and each pointing to a data structure of type fn_zone. (The term zone refers to the networks that share a single netmask.) The fn_zone structures organize routes into hash tables, so routes that lead to destination networks with the same netmask length share the same hash table. Therefore, given any route, its associated hash table can be quickly identified by the route's netmask length. Nonempty fn_zone buckets are linked together, and the head of the list is saved in fn_zone_list. We will see in Chapter 35 how this list is used.

这些每网络掩码哈希表的一般组织有一个例外。/0 区域的表用于默认路由,由单个存储桶组成,因此可折叠为一个简单的列表。做出这种设计选择是因为主机很少维护许多默认路由。

There is one exception to the general organization of these per-netmask hash tables. The table for the /0 zone, used for default routes, consists of a single bucket and therefore collapses into a simple list. This design choice was made because a host rarely maintains many default routes.

路由由不同数据结构的组合来描述,每个数据结构代表一条不同的信息。定义路由的信息被分为多个数据结构,因为多个路由可能仅存在几个字段不同。因此,通过将路由分成多个部分而不是维护一个大而扁平的结构,路由子系统可以更轻松地在相似路由之间共享公共信息,从而隔离不同的功能并在功能之间定义更清晰的接口。

Routes are described by a combination of different data structures, each one representing a different piece of information. The information that defines a route is split into several data structures because it is possible for multiple routes to differ by only a few fields. Thus, by splitting routes in pieces instead of maintaining one large, flat structure, the routing subsystem makes it easier to share common pieces of information among similar routes, and therefore to isolate different functions and define cleaner interfaces among the functions.

对于每个唯一的子网,都有一个 的实例fib_node,由名为 的变量标识fn_key,其值为子网。例如,给定子网 10.1.1.0/24,fn_key则为 10.1.1。请注意,该fib_node结构(及其fn_key变量)与子网相关联,而不是与单个路由相关联;记住这一点很重要,以避免以后发生混淆。此细节的重要性源自到同一子网的不同路由的可能性。

For each unique subnet there is one instance of fib_node, identified by a variable named fn_key whose value is the subnet. For example, given the subnet 10.1.1.0/24, fn_key is 10.1.1. Note that the fib_node structure (and therefore its fn_key variable) is associated to a subnet, not to a single route; it's important to keep this in mind to avoid confusion later. The importance of this detail derives from the possibility of having different routes to the same subnet.

通往同一子网(即相同fn_key)的不同路由共享相同的路由fib_node。每条路线都分配有自己的fib_alias结构。例如,您可以拥有通往同一子网的不同路由,并且仅在 TOS 值方面有所不同:fib_alias因此,每个实例都将被分配不同的 TOS 值。每个fib_alias实例都与一个fib_info结构相关联,该结构存储真实的路由信息​​(即如何到达目的地)。

Different routes leading to the same subnet (i.e., the same fn_key) share the same fib_node. Each route is assigned its own fib_alias structure. You can have, for instance, different routes leading to the same subnet and differing only with regard to the TOS values: each fib_alias instance would therefore be assigned a different TOS value. Each fib_alias instance is associated with a fib_info structure, which stores the real routing information (i.e., how to get to the destination).

给定一个fib_node实例,关联的实例列表 fib_alias按 IP TOS(即字段fa_tos)的升序排序。fib_alias具有相同值的实例按相关字段的fa_tos升序排序 。fib_infofib_protocol

Given a fib_node instance, the associated list of fib_alias instances is sorted in increasing order of IP TOS (i.e., the fa_tos field). fib_alias instances with the same value of fa_tos are sorted in increasing order of the associated fib_info's fib_protocol field.

路由表组织

图 34-1。路由表组织

Figure 34-1. Routing table organization

我在本章前面解释过,路由子系统被分为多个数据结构,以优化它们的使用并使逻辑更清晰。fib_alias因此,和之间的关联fib_info不是一对一的;多个fib_alias结构可以共享一个fib_info结构。当不同的路由碰巧共享现有fib_info结构的相同参数值时,它们只是指向同一个fib_info实例。共享是通过结构上的引用计数来记住的fib_info

I explained earlier in this chapter that the routing subsystem is broken into multiple data structures to optimize their use and make the logic cleaner. Thus, the association between fib_alias and fib_info is not one-to-one; several fib_alias structures can share a fib_info structure. When different routes happen to share the same parameter values of an existing fib_info structure, they simply point to the same fib_info instance. Sharing is remembered through a reference count on the fib_info structure.

例如,如果通往五个不同网络的五条路由碰巧使用相同的下一跳网关,则有关下一跳的信息对于所有网络来说都是相同的,因此共享它是有意义的。因此,在这种情况下,有五种fib_node结构,五种fib_alias 结构,但只有一种fib_info 结构。

If, for instance, five routes to five different networks happen to use the same next-hop gateway, the information about the next hop would be the same for all of them and therefore it will make sense to share it. In this case, therefore, there are five fib_node structures and five fib_alias structures, but only one fib_info structure.

图 34-1中的示例配置显示了构成本节中描述的哈希表的不同结构之间的许多关系。在此图中:

The sample configuration in Figure 34-1 shows a number of relationships among different structures making up the hash tables described in this section. In this figure:

  • 有四个路由(即四个fib_alias 实例)。

  • There are four routes (i.e., four fib_alias instances).

  • 这四个路由通向三个不同的子网(即三个fib_node实例),因为两个fib_alias实例共享一个公共fib_node实例。

  • These four routes lead to three different subnets (i.e., three fib_node instances) because two fib_alias instances share a common fib_node instance.

  • 4 条路由中的 2 条共享相同的下一跳路由器。因此,fa_info这两个fib_alias结构体的字段指向fib_info图右下侧的相同结构体。

  • Two of the four routes share the same next-hop routers. Thus, the fa_info fields of these two fib_alias structures point to the same fib_info structure on the bottom-right side of the figure.

图中右侧出现键的数据结构字段是您将在第 35 章中看到的查找例程所使用的字段。

The data structure fields in the figure where a key appears on the right side are the fields used by the lookup routines you will see in Chapter 35.

动态调整每个网络掩码哈希表的大小

Dynamic resizing of per-netmask hash tables

fz_hash当元素数量超过给定阈值时,哈希表的大小就会增加。哈希表的大小可以重复调整,直至达到给定的上限。“添加路由”部分将准确解释向哈希表中插入新元素如何触发调整大小。

The size of the hash table fz_hash is increased when the number of elements passes a given threshold. A hash table can be resized repeatedly up to a given upper limit. The section "Adding a Route" will explain exactly how the insertion of a new element into the hash table triggers resizing.

fn_hash [ * ]指向的 33 个哈希表中的每一个都会独立调整大小。当表项数量达到存储桶数量(存储在 中的值)的两倍时,表将被调整大小,如图34-1fz_divisor所示。选择这种启发式主要是为了限制哈希表上的查找时间。将元素数量保持在该阈值以下可以保持快速查找(假设元素分布良好)。

Each of the 33 hash tables pointed to by fn_hash [*] is resized independently. A table is resized when the number of entries reaches twice the size of the number of buckets, which is a value stored in fz_divisor, as shown in Figure 34-1. This heuristic is chosen mainly to limit the lookup time on the hash table. Keeping the number of elements below this threshold keeps lookups fast (assuming elements are well distributed).

表的最大大小源自与内存管理相关的特定于体系结构的参数。在 i386 上,最大大小为 8 MB。由于表中的每个元素都由一个指针组成,在 32 位处理器上该指针的大小为 4 个字节,因此 i386 系统可以支持具有超过 200 万个桶的哈希表。

The maximum size of a table is derived from architecture-specific parameters related to memory management. On an i386, the maximum size is 8 MB. Because each element of the table consists of a pointer, which has a size of 4 bytes on a 32-bit processor, an i386 system can support a hash table with more than 2 million buckets.

当哈希表首次由 创建时 fn_new_zone,该表的默认大小为 16 个存储桶。(唯一的例外,如上一节中提到的,是用于默认路由的 /0 区域。)表的前两次扩展,大小分别增加到 256 和 1,024。后续的增加将始终是当前大小的两倍。

When a hash table is first created by fn_new_zone, the table is given a default size of 16 buckets. (The only exception, as mentioned in the previous section, is the /0 zone used for default routes.) The first two times the table is expanded, the size is increased to 256 and 1,024, respectively. Subsequent increases will always double the current size.

目前没有收缩机制。因此,如果一个区域的哈希表从 280 个元素减少到 10 个,表的大小不会从 256 减少到 16。

There is currently no shrink mechanism. So, if a zone's hash table goes from 280 elements down to 10, the size of the table will not decrease from 256 to 16.

fib_info 结构的组织

Organization of fib_info Structures

如图34-1所示,每个fib_info结构体包括两个字段fib_hashfib_lhash,用于将结构体插入到两个更全面的哈希表中,如图34-2所示。这些哈希表是:

As shown in Figure 34-1, each fib_info structure includes two fields, fib_hash and fib_lhash, that are used to insert the structures into two more-comprehensive hash tables, shown in Figure 34-2. These hash tables are:

fib_info_hash
fib_info_hash

所有fib_info结构都插入到这个哈希表中。对该表的查找是通过 完成的fib_find_info

All fib_info structures are inserted into this hash table. Lookups on this table are done with fib_find_info.

fib_info_laddrhash
fib_info_laddrhash

fib_info仅当关联路由具有首选源地址时,结构才会插入到此表中。首选源地址的使用在第35章“首选源地址选择”部分中描述。该地址通常是从设备配置中自动得出的,但也可以显式配置。

该哈希表主要用于方便删除因删除本地配置的 IP 地址而受影响的路由(请参阅 参考资料fib_sync_down)。

fib_info structures are inserted into this table only when the associated routes have a preferred source address. The use of the preferred source address is described in the section "Preferred Source Address Selection" in Chapter 35. That address is normally derived automatically from the device configuration, but it can also be explicitly configured.

This hash table is mainly used to facilitate the removal of routes affected by the deletion of a locally configured IP address (see fib_sync_down).

在这两个表中,新元素都通过 被添加到存储桶列表的头部fib_create_info

In both tables, new elements are added at the head of a bucket's list by fib_create_info.

动态调整全局哈希表的大小

Dynamic resizing of global hash tables

所有路由表中的结构总数fib_info存储在计数器中fib_info_cntfib_create_info它的值随着实例的创建而增加,并free_fib_info随着实例的删除而减少。

The total number of fib_info structures, in all routing tables, is stored in the counter fib_info_cnt. Its value is incremented by fib_create_info when an instance is created and decremented by free_fib_info when an instance is deleted.

创建新实例时,fib_create_info 检查是否fib_info_cnt已达到fib_hash_size哈希表的大小,如图34-2所示。当达到这个尺寸时, 和fib_info_hashfib_info_laddrhash尺寸都会加倍。使用 删除旧的哈希表fib_hash_free,使用 分配新的哈希表fib_hash_alloc,并 fib_info使用 将所有实例从旧表移动到新表fib_hash_move

When creating a new instance, fib_create_info checks whether fib_info_cnt has reached fib_hash_size, which is the size of the hash table, as shown in Figure 34-2. When this size is reached, both fib_info_hash and fib_info_laddrhash are doubled in size. The old hash tables are removed with fib_hash_free, the new ones are allocated with fib_hash_alloc, and all the fib_info instances are moved from the old tables to the new ones with fib_hash_move.

fib_info 结构的组织

图 34-2。fib_info 结构的组织

Figure 34-2. fib_info structures' organization

请注意,本节中讨论的调整大小与“每个网络掩码哈希表的动态调整大小”一节中讨论的调整大小无关。

Note that the resizing discussed in this section has nothing to do with the one discussed in the section "Dynamic resizing of per-netmask hash tables."

下一跳路由器结构的组织

Organization of Next-Hop Router Structures

如图34-1所示,每个fib_info结构可以包括一个或多个fib_nh结构,每个结构代表一个下一跳路由器。下一跳路由器的信息包括可以到达它的设备。这样,在路由器已知的情况下很容易找到设备,但是该结构并没有提供在设备已知的情况下快速找到路由器的方法。后一种能力在两种情况下很重要:

As shown in Figure 34-1, each fib_info structure can include one or more fib_nh structures, each one representing a next-hop router. The information for a next-hop router includes the device through which it can be reached. Thus, it is easy to find a device when the router is known, but the structure does not provide a quick way to find a router when the device is known. The latter ability is important in two cases:

当设备关闭时
When a device is shut down

网络子系统必须禁用与该设备关联的所有路由。这是通过第 32 章fib_sync_down中描述的来完成的。

The networking subsystem has to disable all the routes associated with the device. This is done by fib_sync_down, described in Chapter 32.

当设备启用或重新启用时
When a device is enabled or re-enabled

网络子系统必须启用或重新启用与可通过该设备访问的下一跳路由器关联的所有路由。这是由fib_sync_up,第 32 章也对此进行了描述。

The networking subsystem has to enable or re-enable all the routes associated with next-hop routers reachable via this device. This is done by fib_sync_up, also described in Chapter 32.

还有另一个与ICMP_REDIRECT消息有关的小情况。我们在“处理入口 ICMP_REDIRECT 消息”部分中看到我们在第 31 章,可以让内核仅接受 ICMP 重定向,其新建议的网关在本地已被称为路由器。要检查是否满足此条件,内核只需浏览与接收 ICMP 的设备关联的所有路由,并查找使用新建议网关作为其下一跳路由器的路由。该逻辑由 实现ip_fib_check_default,由 调用ip_rt_redirect。后者由icmp_redirect入口 ICMP 重定向消息的处理程序调用。

There is also another minor case pertaining to ICMP_REDIRECT messages. We saw in the section "Processing Ingress ICMP_REDIRECT Messages" in Chapter 31 that it is possible to have the kernel accept only ICMP redirects whose new suggested gateway is already known locally as a router. To check whether this condition is met, the kernel simply needs to browse all the routes associated with the device the ICMP was received from and look for a route that uses the new suggested gateway as its next-hop router. This logic is implemented by ip_fib_check_default, which is called by ip_rt_redirect. The latter is called by icmp_redirect, the handler for ingress ICMP redirect messages.

刚才描述的要求是通过创建另一个由设备标识符索引的哈希表来解决的;这使得下一跳路由的查找速度非常快。图34-1nh_hash所示的字段用于 在哈希表中插入结构。该表静态分配在 net/ipv4/fib_semantics中,大小为(256) 个桶。新元素通过 插入到表存储桶列表的头部。fib_nhfib_info_devhashDEVINDEX_HASHSIZEfib_create_info

The requirements just described are solved by creating another hash table indexed by the device identifier; this makes lookups of next-hop routes extremely fast. The nh_hash field shown in Figure 34-1 is used to insert fib_nh structures in the fib_info_devhash hash table. That table is statically allocated in net/ipv4/fib_semantics with a size of DEVINDEX_HASHSIZE (256) buckets. New elements are inserted at the head of the table bucket's lists by fib_create_info.

两个默认路由表:ip_fib_main_table 和 ip_fib_local_table

The Two Default Routing Tables: ip_fib_main_table and ip_fib_local_table

无论内核配置选项如何,总是在启动时创建两个路由表:

Two routing tables are always created at boot time regardless of the kernel configuration options:

ip_fib_local_table
ip_fib_local_table

内核在这里安装到本地地址的路由,包括相关子网的地址和子网的广播地址(参见第36章中的“内核插入的路由:fib_magic函数”部分)。该路由表不能由用户显式配置。

The kernel installs routes to local addresses here, including the associated subnet's address and the subnet's broadcast addresses (see the section "Routes Inserted by the Kernel: The fib_magic Function" in Chapter 36). This routing table cannot be explicitly configured by the user.

ip_fib_main_table
ip_fib_main_table

所有其他路由都放在这里(用户配置的路由和路由协议生成的路由)。

All other routes go here (user-configured routes and routes generated by routing protocols).

第30章的“特殊路由” 部分解释了这两个路由表之间的关系。在第 35 章中,我们将看到路由查找如何使用它们。

The section "Special Routes" in Chapter 30 explains the relationship between these two routing tables. In Chapter 35, we will see how routing lookups use them.

路由表初始化

Routing Table Initialization

路由表使用net/ipv4/fib_hash.cfib_hash_init中定义的进行初始化。它由 调用 ,它初始化 IP 路由子系统,以创建和表(请参阅“路由子系统初始化”部分)ip_fib_initip_fib_main_tableip_fib_local_table表(参见第 32 章”)。

Routing tables are initialized with fib_hash_init, defined in net/ipv4/fib_hash.c. It is called by ip_fib_init, which initializes the IP routing subsystem, to create the ip_fib_main_table and ip_fib_local_table tables (see the section "Routing Subsystem Initialization" in Chapter 32).

第一次fib_hash_init调用,它创建内存池fn_hash_kmem将用于分配fib_node数据结构的内存池。

The first time fib_hash_init is called, it creates the memory pool fn_hash_kmem that will be used to allocate fib_node data structures.

fib_hash_init首先分配一个数据结构,然后将其虚拟函数初始化为表34-1fib_table所示的例程。该函数还清除结构体底部部分的内容 ( ),如下所示fn_hash34-1所示,该结构体用于根据网络掩码长度将路由条目分布在不同的哈希表上。

fib_hash_init first allocates a fib_table data structure and then initializes its virtual functions to the routines shown in Table 34-1. The function also clears the content of the bottom part of the structure (fn_hash), which, as shown in Figure 34-1, is used to distribute the routing entries on different hash tables based on their netmask lengths.

表 34-1。fib_table 的虚函数的初始化

Table 34-1. Initialization of the fib_table's virtual functions

方法

Method

使用的常规

Routine used

tb_查找

tb_lookup

fn_hash_lookup

fn_hash_lookup

tb_插入

tb_insert

fn_hash_插入

fn_hash_insert

tb_删除

tb_delete

fn_哈希_删除

fn_hash_delete

tb_flush

tb_flush

fn_hash_flush

fn_hash_flush

tb_select_default

tb_select_default

fn_hash_select_default

fn_hash_select_default

tb_转储

tb_dump

fn_hash_dump

fn_hash_dump

添加和删​​除路线

Adding and Removing Routes

第36章中,我们将看到如何通过用户命令和路由守护进程来添加、删除和修改路由。两者都可以通过内核路由子系统中的一组例程来满足。在本节中,我们将了解当要求内核在其路由表之一中添加或删除路由时必须执行的操作。如表34-1所示,fn_hash_insert和是用于插入和删除路由的例程,我们将在“添加路由”和“删除路由fn_hash_delete”章节中对其进行分析。有几个相关的用途,都涉及路线的改变。fn_hash_insert

In Chapter 36, we will see how routes are added, deleted, and modified by user commands and routing daemons. Both are satisfied through a single set of routines in the kernel's routing subsystem. In this section, we will see what the kernel has to do when asked to add or remove a route from one of its routing tables. As shown in Table 34-1, fn_hash_insert and fn_hash_delete are the routines used to insert and delete routes, and we will analyze them in the sections "Adding a Route" and "Deleting a Route." fn_hash_insert has several related uses, all involving changes of routes.

以下是两个例程共有的一些操作:

Here are a few operations common to the two routines:

  • 给定要添加或删除的路由,派生搜索键并使用它进行查找 fib_nodefib_alias查找。这些查找与路由数据包的查找类似,但目的不同:检查添加的路由是否与现有路由重复,或者删除的路由是否确实存在。

  • Given a route to add or remove, derive the search key and use it to make a fib_node lookup and a fib_alias lookup. These lookups are similar to the ones done to route data packets, but are done for a different purpose: to check whether a route being added is a duplicate of an existing route, or whether a route being removed really exists.

  • 填充(如果是插入)并清理(如果是删除)正确的哈希表。

  • Populate (in case of insert) and clean up (in case of delete) the right hash tables.

  • 如有必要,请刷新路由缓存。

  • Flush the routing cache if necessary.

  • 生成一个 Netlink 广播通知,告诉感兴趣的监听器,路由已添加到路由表或从路由表中删除(请参阅第32 章中的“ Netlink 通知”部分)。

  • Generate a Netlink broadcast notification to tell the interested listeners that a route has been added to or removed from a routing table (see the section "Netlink Notifications" in Chapter 32).

添加路线

Adding a Route

新路由的插入由 负责fn_hash_insert,其逻辑如图 34-3(a) 和 34-3(b) 所示。[ * ]这个例程实际上被调用用于许多操作:除了插入新路由之外,它还处理附加、前置、更改和替换。不同的情况通过传递的标志来区分NLM_F_ XXX第 36 章36-1列出了与每个操作相关的标志组合。

The insertion of a new route is taken care of by fn_hash_insert, whose logic is described in Figures 34-3(a) and 34-3(b).[*] This routine is actually called for many operations: in addition to the insertion of new routes, it handles appending, prepending, changing, and replacing. The different cases are distinguished by the NLM_F_ XXX flags passed. The combination of flags associated to each operation is listed in Table 36-1 in Chapter 36.

不同操作的不同要求使函数的逻辑变得复杂。例如,如“路由哈希表的组织”一节中提到的一节中提到的,具有不同 TOS 值的不同路由可以通向同一目的地。当内核添加新路由时,如果已经存在具有相同目的地和 TOS 的路由,则会返回错误。然而,替换路由时实际上需要相同的条件。因此,根据命令类型,执行的路由查找fn_hash_insert 预计会返回不同的结果。

The different requirements of different operations complicate the function's logic. For instance, as mentioned in the section "Organization of Routing Hash Tables," different routes with different TOS values can lead to the same destination. When the kernel adds a new route, it returns an error if there is already a route with the same destination and TOS. However, the same condition is actually a requirement when replacing a route. Therefore, based on the command type, the route lookup done by fn_hash_insert is expected to return a different result.

正如“动态调整每个网络掩码哈希表的大小”部分中所解释的,插入新路由可能会触发区域哈希表的大小调整,这是由 处理的fn_rehash_zone如“下一跳路由器结构的组织”部分中所述,当关联的路由指定首选源地址时,新fib_info结构将添加到哈希表中。代表路由下一跳之一的fib_info_devhash每个结构也被添加到哈希表中。fib_nhfib_info_devhash

As explained in the section "Dynamic resizing of per-netmask hash tables," the insertion of a new route may trigger the resizing of a zone's hash table, which is taken care of by fn_rehash_zone. As explained in the section "Organization of Next-Hop Router Structures," new fib_info structures are added to the fib_info_devhash hash table when the associated routes specify a preferred source address. Each fib_nh structure representing one of the route's next hops is also added to the fib_info_devhash hash table.

当替换操作用新路由替换现有路由时,内核会刷新路由缓存,以便不再使用旧路由。

When a replace operation replaces an existing route with a new one, the kernel flushes the routing cache so that the old route is no longer used.

无论操作类型如何,都会生成 Netlink 通知来通知所有感兴趣的子系统。

Regardless of the type of operation, a Netlink notification is generated to notify all of the interested subsystems.

删除路线

Deleting a Route

路由的删除由 负责,其逻辑如图34-4fn_hash_delete所示。删除路由比添加路由更简单;例如,只有一种操作。

The deletion of a route is taken care of by fn_hash_delete, whose logic is described in Figure 34-4. Deleting a route is simpler than adding one; for example, there is only one type of operation.

首先fn_hash_delete计算搜索键并将其用于查找以查看要删除的条目是否实际存在。fib_alias当找到受害者 结构时,该函数将其删除,通过 Netlink 广播通知感兴趣的侦听器,并在路由已被使用(即FA_S_ACCESSED设置了标志)的情况下刷新路由缓存。

First fn_hash_delete computes the search key and uses it for a lookup to see whether the entry to remove actually exists. When the victim fib_alias structure is found, the function deletes it, notifies interested listeners with a Netlink broadcast, and flushes the routing cache in case the route has been used (i.e., it has the FA_S_ACCESSED flag set).

删除实例会导致实例和实例同时fib_alias被删除(参考图34-1 ):fib_infofib_node

The deletion of a fib_alias instance can lead to the deletion of a fib_info instance and a fib_node instance as well (use Figure 34-1 as a reference):

  • 当关联的实例因删除的是其最后一个实例而fib_node留空时,也会被删除。fib_aliasfib_node

    fn_hash_insert 函数

    图 34-3a。fn_hash_insert 函数

    fn_hash_insert 函数

    图 34-3b。fn_hash_insert 函数

  • When the associated fib_node instance is left empty because the deleted fib_alias was its last instance, the fib_node gets deleted, too.

    Figure 34-3a. fn_hash_insert function

    Figure 34-3b. fn_hash_insert function

  • 当关联的fib_info实例保留空fib_treeref引用计数时,它会被释放,因为不再需要它。特别是,立即 fn_free_alias释放匹配的实例,并使用 减少关联实例的引用计数。当该引用计数降至零时,该实例将从其插入的所有哈希表中取出,通过设置其标志将其标记为死亡,并在第一次调用时释放该实例。与实例关联的下一跳也将从哈希表中取出,如“下一跳路由器结构的组织”部分中所述fib_aliasfib_treereffib_infofib_release_infofib_infofib_deadfree_fib_infofib_info_putfib_info”。

  • When the associated fib_info instance is left with a null fib_treeref reference count, it is freed because it is not needed anymore. In particular, fn_free_alias frees the matching fib_alias instance right away, and decrements the reference count fib_treeref on the associated fib_info instance with fib_release_info. When that reference count drops to zero, the fib_info instance is taken out of all of the hash tables it was inserted into, is marked dead by setting its fib_dead flag, and is freed with free_fib_info at the first invocation of fib_info_put. The next hops associated with the fib_info instance are also taken out of the hash table, as described in the section "Organization of Next-Hop Router Structures."

fa_list对和列表的操作fn_alias受到fib_hash_lock锁的保护(见图34-1)。

Manipulations of the fa_list and fn_alias lists are protected by the fib_hash_lock lock (see Figure 34-1).

fn_hash_delete 函数

图 34-4。fn_hash_delete 函数

Figure 34-4. fn_hash_delete function

垃圾收集

Garbage Collection

当路由因配置更改或本地设备状态更改而失效时,应将其删除。路由子系统中的多个功能浏览路由表(或其中的一部分)。在某些条件下,其中之一fib_sync_down,用 标志来标记符合删除条件的路由RTNH_F_DEAD。稍后,再次调用fib_flush浏览路由表并删除那些设置了标志的路由。没有像清理路由缓存那样定期清理路由表的功能。

Routes should be deleted when they are invalidated by configuration changes or changes of status for local devices. Several functions in the routing subsystem browse the routing tables (or a portion of them). Under certain conditions, one of them, fib_sync_down, marks the routes that are eligible for deletion with the RTNH_F_DEAD flag. Later, a call to fib_flush browses the routing tables again and removes those routes with the flag set. There is no periodic function that cleans up the routing tables in the way the routing cache is cleaned up.

该例程在第 32 章的“辅助例程fib_sync_down部分中进行了描述。

The fib_sync_down routine is described in the section "Helper Routines" in Chapter 32.

策略路由及其对路由表定义的影响

Policy Routing and Its Effects on Routing Table Definitions

当内核编译为支持策略路由时,管理员最多可以配置 255 个独立的路由表。为了支持此可选功能,同时在不使用策略路由时保持路由子系统精简和简单,Linux 开发人员在源代码中添加了一些复杂性,您在尝试读取文件之前应该了解这些复杂性。

When the kernel is compiled with support for Policy Routing, an administrator can configure up to 255 independent routing tables. To support this optional feature, while keeping the routing subsystem lean and simple when Policy Routing is not used, the Linux developers have added some complexity to the source code that you should understand before trying to read the files.

变量和结构定义

Variable and Structure Definitions

对于策略路由,指向 255 个路由表的指针存储在数组中 ,该数组在net/ipv4/fib_frontend.cfib_tables中定义,如图 34-1所示。

With Policy Routing, the pointers to the 255 routing tables are stored in the fib_tables array, defined in net/ipv4/fib_frontend.c and shown in Figure 34-1.

结构 fib_table *fib_tables[RT_TABLE_MAX+1];
struct fib_table *fib_tables[RT_TABLE_MAX+1];

请注意,两个路由表ip_fib_main_table和被定义为include/net/ip_fib.hip_fib_local_table 的两个元素:fib_tables

Note that the two routing tables ip_fib_main_table and ip_fib_local_table are defined as two elements of fib_tables in include/net/ip_fib.h:

#ifndef CONFIG_IP_MULTIPLE_TABLES
外部结构 fib_table *ip_fib_local_table;
外部结构 fib_table *ip_fib_main_table;
…………
#别的
#定义 ip_fib_local_table (fib_tables[RT_TABLE_LOCAL])
#定义 ip_fib_main_table (fib_tables[RT_TABLE_MAIN])
…………
#万一
#ifndef CONFIG_IP_MULTIPLE_TABLES
extern struct fib_table *ip_fib_local_table;
extern struct fib_table *ip_fib_main_table;
... ... ...
#else
#define ip_fib_local_table (fib_tables[RT_TABLE_LOCAL])
#define ip_fib_main_table (fib_tables[RT_TABLE_MAIN])
... ... ...
#endif

当第一条路由被添加到新的路由表中时,该表将被初始化 fib_hash_init。如果未配置策略路由,则仅在启动时调用此函数,因此使用宏进行标记 _ _init[ * ]但是使用策略路由,可以随时创建新的路由表,因此 fib_hash_init不能进行这样的标记。这解释了它的条件原型定义:

When the first route is added to a new routing table, the table is initialized with fib_hash_init. When Policy Routing is not configured, this function is called only at boot time and therefore is tagged with the _ _init macro.[*] But with Policy Routing, a new routing table can be created at any time, so fib_hash_init cannot be so tagged. This explains its conditional prototype definition:

#ifdef CONFIG_IP_MULTIPLE_TABLES
结构 fib_table * fib_hash_init(int id)
#别的
结构 fib_table * _ _init fib_hash_init(int id)
#万一
{
    …………
}
#ifdef CONFIG_IP_MULTIPLE_TABLES
struct fib_table * fib_hash_init(int id)
#else
struct fib_table * _ _init fib_hash_init(int id)
#endif
{
    ... ... ...
}

ip_fib_main_table即使有策略路由支持,除非显式指定不同路由表的 ID,否则所有配置的路由都会添加到其中。表ID只能通过new ip命令提供,不能通过传统的route命令提供。

Even with Policy Routing support, all configured routes are added to ip_fib_main_table unless an ID for a different routing table is explicitly specified. The table ID can be provided only with the new ip command, not with the traditional route command.

函数的双重定义

Double Definitions for Functions

策略路由功能没有透明地集成到路由代码中。例如,仅当内核中存在策略路由支持时才需要的变量、例程或代码片段受到预处理器条件变量的保护CONFIG_IP_MULTIPLE_TABLES[ * ]

The Policy Routing feature is not transparently integrated into the routing code. For example, the variables, routines, or pieces of code that are needed only when there is Policy Routing support in the kernel are protected by the preprocessor conditional variable CONFIG_IP_MULTIPLE_TABLES.[*]

还有一些全局变量和函数具有双重定义,一种在内核中没有策略路由支持时使用,另一种在支持时使用。net/ipv4/fib_rules.cinclude/net/ip_fib.h中定义的两个重要规则是:

There are also a few global variables and functions that have a double definition, one to use when there is no policy routing support in the kernel and another one to use when there is support. Two important ones, defined in net/ipv4/fib_rules.c and in include/net/ip_fib.h, are:

fib_lookup
fib_lookup

用于进行路由表查找,并在第 35 章中进行了描述

Used to make routing table lookups, and described in Chapter 35

fib_select_default
fib_select_default

用于在没有到达目的地的显式路由时转发数据包时选择默认路由

Used to select a default route when forwarding a packet when there is no explicit route to its destination

除了这两个函数之外,还有一些其他函数的双重定义,例如 fib_get_table(返回给定表 ID 的路由表)和fib_new_table(创建新的路由表)。

Besides these two functions, there are double definitions for a few others, such as fib_get_table (which returns the routing table given the table ID) and fib_new_table (which creates a new routing table).

浏览源代码时,特别是使用 TAGS 和cscope等工具时,了解这些例程的双重定义非常重要。否则,您在分析给定的代码路径时可能会看到错误的实例。

It is important to be aware of the double definitions of these routines when browsing the source code, particularly with tools such as TAGS and cscope. Otherwise, you may be looking at the wrong instance while analyzing a given code path.




[ * ]不要fn_hash与混淆fz_hash

[*] Do not confuse fn_hash with fz_hash.

[ * ]该流程图并不完全遵循源代码流程,但保留了逻辑。

[*] The flowchart does not follow the source code flow precisely, but preserves the logic.

[ * ]第7章介绍了宏的使用和含义_ _init

[*] Chapter 7 describes the use and meaning of the _ _init macro.

[ * ]不要将多个表误认为是多路径;它们是两个完全不同且独立的特征。

[*] Do not mistake multiple tables with multipath; they are two totally different and independent features.

第 35 章路由:查找

Chapter 35. Routing: Lookups

第33章中,我们看到了如何查找由入口和出口流量触发。总是首先搜索缓存,当未命中时,通过ip_route_input_slowip_route_output_slow 函数查询路由表。在本章中,我们将分析这些函数;特别是,我们将涵盖:

In Chapter 33, we saw how lookups are triggered by both ingress and egress traffic. The cache is always searched first, and when there is a miss, the routing tables are consulted through the ip_route_input_slow and ip_route_output_slow functions. In this chapter, we will analyze these functions; in particular, we will cover:

  • 入口和出口路由有何不同

  • How ingress and egress routing differ

  • 如何搜索路由表fib_lookup

  • How a routing table is searched with fib_lookup

  • 策略路由查找与普通查找有何不同

  • How policy routing lookups differ from normal lookups

  • 何时以及如何处理多路径路由

  • When and how multipath routing is handled

  • 默认网关选择的工作原理

  • How the selection of a default gateway works

查找函数的高级视图

High-Level View of Lookup Functions

无论流量的方向如何,都会使用 进行路由表查找fib_lookup。然而,正如第 34 章“函数的双重定义” 一节中提到的,有两个版本 ,一种是在内核支持策略路由 ( net/ipv4/fib_rules.c ) 时使用,另一种是在不包含该支持时使用(包括/net/ip_fib.h)。正确例程的选择是在编译时进行的,因此当和调用时,它们会透明地调用正确的例程。fib_lookupip_route_input_slowip_route_output_slowfib_lookup

Regardless of the direction of the traffic, a routing table lookup is made with fib_lookup. However, as mentioned in the section "Double Definitions for Functions" in Chapter 34, there are two versions of fib_lookup, one used when the kernel has support for Policy Routing (net/ipv4/fib_rules.c) and one when that support is not included (include/net/ip_fib.h). The selection of the right routine is made at compile time, so when ip_route_input_slow and ip_route_output_slow call fib_lookup, they transparently invoke the right one.

让我们简要了解一下用于进行路由查找的关键函数。您会发现在讨论过程中参考第 34 章中的图 34-1很有帮助。

Let's briefly see the key functions used to make a route lookup. You will find it helpful to refer to Figure 34-1 in Chapter 34 during this discussion.

fib_lookup例程是每个路由表提供的查找功能的包装器。没有策略路由时提供的版本只是运行本地表和主表的查找功能,而另一个版本则具有更复杂的逻辑,允许其查询策略路由提供的表。

The fib_lookup routine is a wrapper around the lookup function provided by each routing table. The version provided when there is no policy routing simply runs the lookup function for the local and main tables, and the other has more complicated logic that allows it to consult the tables provided by policy routing.

如图 35-1所示,从fib_lookupis调用的查找函数fn_hash_lookup是 的fib_table函数指针初始化的例程(参见第 34 章中的“路由表初始化tb_lookup部分)。该函数识别其密钥与目标地址匹配的实例。然后要求查找与匹配相关的实例。如果已识别,则在配置多路径时可能还需要选择正确的下一跳。fib_nodefn_hash_lookupfib_semantic_matchfib_aliasfib_nodefib_semantic_match

As shown in Figure 35-1, the lookup function invoked from fib_lookup is fn_hash_lookup, which is the routine to which the fib_table's function pointer tb_lookup is initialized (see the section "Routing Table Initialization" in Chapter 34). This function identifies the fib_node instance whose key matches the destination address. Then fn_hash_lookup asks fib_semantic_match to do a lookup on the fib_alias instances associated with the matching fib_node. If one is identified, fib_semantic_match may also need to select the right next hop when Multipath is configured.

主要查找例程之间的关系

图 35-1。主要查找例程之间的关系

Figure 35-1. Relationships among the main lookup routines

这里介绍的所有功能将在后面的章节中详细介绍。特别是,它们涵盖:

All the functions introduced here are described in detail in later sections. In particular, they cover:

辅助例程

Helper Routines

以下是我们将在本章中介绍的一些函数所使用的一些例程:

Here are a few routines used by some of the functions we will cover in this chapter:

fib_validate_source
fib_validate_source

验证给定设备上收到的数据包的源 IP 地址,以检测欺骗尝试。除此之外,该功能确保除非启用了非对称路由,否则可以通过接收数据包的同一接口到达数据包的源 IP 地址(请参阅第 31 章中的“反向路径过滤部分)。它还返回用于反向的首选源地址,如即将到来的“首选源地址选择”部分中所述,并初始化路由标记,如“基于路由表的分类器”部分中所述spec_dst”在第31章

Validates the source IP address of a packet received on a given device, to detect spoofing attempts. Among other things, this function makes sure that unless asymmetric routing is enabled, the source IP address of the packet is reachable through the same interface the packet was received from (see the section "Reverse Path Filtering" in Chapter 31). It also returns the preferred source address spec_dst to use for the reverse direction, as described in the upcoming section "Preferred Source Address Selection," and initializes the routing tag, as described in the section "Routing Table Based Classifier" in Chapter 31.

inet_select_addr
inet_select_addr

给定设备dev、IP 地址dst和范围scope,返回第一个具有范围的主地址scope,用于将数据包发送到 dst设备外的地址dev

需要此例程是因为设备可以配置多个地址,并且每个地址可以有自己的范围。

使用该参数的原因dst是,如果 dev在不同的子网上配置了不同的 IP 地址,dst则允许此函数返回与 相同的子网上配置的 IP 地址dst

在第30章的“范围” 一节中,我们看到有主地址和次地址; 仅返回主地址。inet_select_addr

如果没有配置的地址满足和dev指定的条件,则该函数会尝试其余设备,检查是否有配置了所需范围的地址。由于该 设备是第一个插入列表的设备,因此它将是第一个被尝试的设备。scopedstloopback_devdev_base

Given a device dev, an IP address dst, and a scope scope, returns the first primary address with scope scope, to use in sending a packet to the address dst out of device dev.

This routine is needed because a device can be configured with multiple addresses, and each can have its own scope.

The reason for the dst argument is that if dev is configured with different IP addresses on different subnets, dst allows this function to return an IP address configured on the same subnet as dst.

In the section "Scope" in Chapter 30, we saw that there are primary and secondary addresses; inet_select_addr returns only primary addresses.

If no address configured on dev meets the conditions specified by scope and dst, the function tries the rest of the devices, checking if any have an address configured with the required scope. Because the loopback_dev device is the first one inserted into the dev_baselist, it will be the first one to be tried.

rt_set_nexthop
rt_set_nexthop

给定一个路由缓存条目rtable和一个路由表查找结果,完成嵌入结构的和向量等字段 res的初始化。该函数还初始化第 31 章“基于路由表的分类器”部分中描述的路由标记。rtablert_gatewaymetricsdst_entry

Given a routing cache entry rtable and a routing table lookup result res, completes the initialization of rtable's fields, such as rt_gateway and the metrics vector of the embedded dst_entry structure. This function also initializes the routing tag described in the section "Routing Table Based Classifier" in Chapter 31.

表查找:fn_hash_lookup

The Table Lookup: fn_hash_lookup

所有路由表查找,无论策略路由提供的表和流量的方向如何,都是通过 完成的fn_hash_lookup。该函数被注册为结构体tb_lookup函数指针的处理程序(参见第34章中的“路由表初始化”部分)。fib_tablefib_hash_init

All routing table lookups , regardless of the tables provided by Policy Routing and the direction of the traffic, are done with fn_hash_lookup. This function is registered as the handler for the tb_lookup function pointer of the fib_table structure in fib_hash_init (see the section "Routing Table Initialization" in Chapter 34).

该函数的查找算法使用第30章中介绍的LPM算法。通过将路由组织到每个网络掩码哈希表中,可以促进该算法的执行,如第 34 章中的图 34-1所示。搜索具有 将数据包路由到特定目的地的信息的实例。fn_hash_lookupfib_node

The function's lookup algorithm uses the LPM algorithm introduced in Chapter 30. The execution of this algorithm is facilitated by the organization of routes into per-netmask hash tables, as shown in Figure 34-1 in Chapter 34. fn_hash_lookup searches for the fib_node instance that has the information to route packets to a particular destination.

的原型fn_hash_lookup是:

The prototype for fn_hash_lookup is:

静态整型
fn_hash_lookup(struct fib_table *tb, const struct flowi *flp, struct fib_result *res)
static int
fn_hash_lookup(struct fib_table *tb, const struct flowi *flp, struct fib_result *res)

下面是其输入参数的含义:

Here is the meaning of its input parameters:

tb
tb

要搜索的路由表。因为fn_hash_lookup它是一次在一个表上运行的通用查找例程,所以要搜索的表由调用者决定,具体取决于策略路由支持和相关因素。

The routing table to search. Because fn_hash_lookup is a generic lookup routine that runs on one table at a time, the tables to search are decided by the caller, depending on Policy Routing support and related factors.

flp
flp

搜索键。

Search key.

res
res

成功后,res将使用路由信息进行初始化。

Upon success, res is initialized with the routing information.

这些是可能的返回值:

And these are the possible return values:

0:成功
0: success

res已使用fib_semantic_match转发信息初始化(由 )。

res has been initialized (by fib_semantic_match) with the forwarding information.

1:失败
1: failure

没有路线与搜索键匹配。

No route matched the search key.

小于 0:管理失败
Less than 0: Administrative failure

这意味着查找无法成功,因为找到的路由没有任何价值:例如,关联的主机可能被标记为无法访问。

This means the lookup cannot succeed because the route found is of no value: for instance, the associated host may be flagged as unreachable.

LPM 算法在路由上循环,从代表最长网络掩码的区域开始。这是因为更长的网络掩码意味着更具体的路由,这反过来又意味着数据包可能更接近最终目的地。(例如,只能覆盖 30 个主机的 /27 网络掩码优于可能覆盖 254 个主机的 /24 网络掩码。)因此,搜索将从具有最长网络掩码的区域开始浏览所有活动区域。正如我们在第 34 章的“路由哈希表的组织” 一节中看到的,所有活动区域都按网络掩码长度排序并存储列表的头部。fn_zone_list

The LPM algorithm loops over the routes, starting with the zone that represents the longest netmask. This is because longer netmasks mean more specific routes, which in turn means that the packet is likely to get closer to the final destination. (For instance, a /27 netmask that can cover only 30 hosts is preferred over a /24 netmask that potentially covers 254.) Thus, the search browses all the active zones, starting from the ones with the longest netmasks. As we saw in the section "Organization of Routing Hash Tables" in Chapter 34, all the active zones are sorted by netmask length and fn_zone_list stores the head of the list.

    struct fn_hash *t = (struct fn_hash*)tb->tb_data;
    read_lock(&fib_hash_lock);
    for (fz = t->fn_zone_list; fz; fz = fz->fz_next) {
        结构hlist_head *头;
        结构hlist_node *节点;
        结构 fib_node *f;
    struct fn_hash *t = (struct fn_hash*)tb->tb_data;
    read_lock(&fib_hash_lock);
    for (fz = t->fn_zone_list; fz; fz = fz->fz_next) {
        struct hlist_head *head;
        struct hlist_node *node;
        struct fib_node *f;

该函数将目标 IP 地址与正在检查的活动区域的网络掩码进行 AND 运算,并将结果用作搜索关键字。例如,如果该函数当前正在检查 /24 区域,并且目标地址flp->fl4_dst为 10.0.1.2,则搜索键k将初始化为 10.0.1.2 和 255.255.255.0,最终结果为 10.0.1.0。这意味着以下代码段会搜索到子网 10.0.1.0/24 的路由:

The function ANDs the destination IP address with the netmask of the active zone being checked, and uses the result as a search key. For example, if the function is currently checking the /24 zone, and the destination address flp->fl4_dst is 10.0.1.2, the search key k is initialized to 10.0.1.2 & 255.255.255.0, which comes out to 10.0.1.0. This means that the following piece of code searches for a route to the subnet 10.0.1.0/24:

        u32 k = fz_key(flp->fl4_dst, fz);
        u32 k = fz_key(flp->fl4_dst, fz);

由于路由存储在哈希表 ( fz_hash) 中,head因此通过对键应用哈希函数来选择表中正确的存储桶k。下一步是浏览与所选表的存储桶关联的路由(结构)列表fib_node ,并查找匹配的路由(结构)列表 k

Because routes are stored in a hash table (fz_hash), head selects the right bucket of the table by applying a hash function to the key k. The next step is to browse the list of routes (fib_node structures) associated with the selected table's bucket and look for one that matches k.

        头 = &fz->fz_hash[fn_hash(k, fz)];
        hlish_for_each_entry(f, 节点, 头, fn_hash) {
            if (f->fn_key != k)) {
                继续;
 
            错误 = fib_semantic_match(&f->fn_alias,
                                            flp,资源,
                                            f->fn_key, fz->fz_mask,
                                            fz->fz_order);
 
            如果(错误 < 0)
                转到出去;
        }
    }
    错误=1;
出去:
    read_unlock(&fib_hash_lock);
    返回错误;
}
        head = &fz->fz_hash[fn_hash(k, fz)];
        hlish_for_each_entry(f, node, head, fn_hash) {
            if (f->fn_key != k)) {
                continue;
 
            err = fib_semantic_match(&f->fn_alias,
                                            flp, res,
                                            f->fn_key, fz->fz_mask,
                                            fz->fz_order);
 
            if (err < 0)
                goto out;
        }
    }
    err = 1;
out:
    read_unlock(&fib_hash_lock);
    return err;
}

我们在第 34 章的“路由哈希表的组织”部分中看到 ,a涵盖了通向同一子网但在其他字段(例如 TOS)上可能不同的所有路由。现在,如果设法找到与搜索键匹配的,该函数仍然需要检查每个潜在的路由,以找到也与通过参数输入中接收的其他搜索键字段匹配的路由。此详细检查由 进行,下一节将对此进行描述。fib_nodefn_hash_lookupfib_nodekflpfib_semantic_match

We saw in the section "Organization of Routing Hash Tables" in Chapter 34 that a fib_node covers all the routes that lead to the same subnet but that could differ on other fields such as TOS. Now, if fn_hash_lookup manages to find a fib_node that matches the search key k, the function still needs to check each potential route to find one that also matches the other search key fields received in input through the flp parameter. This detailed check is taken care of by fib_semantic_match, described in the next section.

如果返回成功,它还会初始化存储查找结果的fib_semantic_match输入参数,并将该结果返回给其调用者。循环遍历所有区域,直到返回成功结果或发现表的路由不可用(即,它们不匹配)。resfn_hash_lookupfn_hash_lookupfib_semantic_match

If fib_semantic_match returns success, it also initializes the input parameter res that stores the result of the lookup, and fn_hash_lookup returns this result to its caller. fn_hash_lookup loops through all the zones until fib_semantic_match either returns a successful result or discovers that the table's routes are unusable (i.e., they do not match).

辅助标准的语义匹配

Semantic Matching on Subsidiary Criteria

fib_semantic_match调用以查找与给fib_alias定关联的路由(结构)中是否有任何路由与 fib_node所有必需的搜索关键字字段匹配。fn_hash_lookup我们在上一节中看到,在调用此函数之前,主字段(数据包必须路由到的最终目标 IP 地址)已匹配。因此,需要fib_semantic_match检查其他标准。

fib_semantic_match is called to find whether any routes (fib_alias structures) among the ones associated with a given fib_node match all the required search key fields. We saw in the previous section that the main field, the final destination IP address to which the packet must be routed, was matched by fn_hash_lookup before invoking this function. So it falls to fib_semantic_match to check the other criteria.

一旦fib_semantic_match识别出正确的实例fib_alias,它只需要从关联的fib_node. 唯一需要的额外任务是选择下一跳。仅当匹配路由使用多路径时才需要最后一项任务,可以通过两种方式处理:

Once fib_semantic_match has identified the right instance of fib_alias, it simply needs to extract the routing information from the associated fib_node. The only additional task required is the selection of the next hop. This last task is needed only when the matching route uses Multipath, and it can be handled in two ways:

  • 通过fib_semantic_match,当搜索关键字提供出口设备时。

  • By fib_semantic_match, when the search key provides an egress device.

  • 通过fib_select_multipath,当搜索关键字不提供出口设备时。由or例程fib_select_multipath调用。ip_route_input_slowip_route_output_slow

  • By fib_select_multipath, when the search key does not provide an egress device. fib_select_multipath is called by the ip_route_input_slow or ip_route_output_slow routine.

其逻辑如图35-2fib_semantic_match所示。

The logic of fib_semantic_match is shown in Figure 35-2.

fib_semantic_match 函数

图 35-2。fib_semantic_match 函数

Figure 35-2. fib_semantic_match function

拒绝路由的标准

Criteria for rejecting routes

浏览fib_alias结构时,fib_semantic_match拒绝以下结构:

While browsing fib_alias structures, fib_semantic_match rejects the ones that:

  • 与服务条款不匹配。请注意,当路由未配置 TOS 值时,它们可用于路由具有任何 TOS 的数据包。

  • Do not match the TOS. Note that when routes are not configured with a TOS value, they can be used to route packets with any TOS.

  • 范围比搜索关键字指定的范围窄。例如,如果路由子系统正在寻找具有范围的路由RT_SCOPE_UNIVERSE,则它不能使用具有范围的路由 RT_SCOPE_LINK

  • Have a narrower scope than the one specified with the search key. For example, if the routing subsystem is looking for a route with scope RT_SCOPE_UNIVERSE, it cannot use one with scope RT_SCOPE_LINK.

此外,该功能必须检查路由或所需的下一跳是否已消失,在这种情况下,路由子系统已通过设置其标志将其标记为删除RTNH_F_DEAD第 32 章中的“辅助例程”部分展示了如何为整个路由或路由的单个下一跳设置标志。RTNH_F_DEAD

Furthermore, the function must check whether a route or the desired next hop has gone away, in which case the routing subsystem has marked it for deletion by setting its RTNH_F_DEAD flag. The section "Helper Routines" in Chapter 32 shows how the RTNH_F_DEAD flag can be set for an entire route or for a single next hop of a route.

一旦fib_alias识别出符合条件的实例,并且假设关联的fib_info 结构可用(即,未标记RTNH_F_DEAD), fib_semantic_match则需要浏览所有下一跳的fib_nh实例以查找也与搜索关键字的设备匹配的实例(如果指定了设备)。有可能没有一个下一跳实际上可以使用。发生这种情况的主要原因有两个:

Once an eligible fib_alias instance has been identified, and supposing the associated fib_info structure is usable (i.e., not marked RTNH_F_DEAD), fib_semantic_match needs to browse all the next hops' fib_nh instances to find one that also matches the search key's device, if a device was specified. It is possible that none of the next hops can actually be used. This could happen for one of two main reasons:

  • 所有下一跳均不可用(即,它们RTNH_F_DEAD设置了标志)。

  • All the next hops are unusable (that is, they have their RTNH_F_DEAD flags set).

  • 搜索关键字指定的出口设备与任何下一跳配置都不匹配。

  • The search key specifies an egress device that does not match any of the next hop configurations.

当不支持多路径时,只能有一个下一跳。

When there is no support for Multipath, there can be only one next hop.

浏览fib_alias实例时,对那些满足本节前面提到的范围和 TOS 要求的实例fib_semantic_match设置标志。FA_S_ACCESSED无论是否fib_alias选择,都会设置该标志。如果fib_alias实例被删除,则将考虑此标志来决定是否应刷新缓存。

While browsing fib_alias instances, fib_semantic_match sets the FA_S_ACCESSED flag on those that meet the scope and TOS requirements mentioned earlier in this section. The flag is set regardless of whether the fib_alias is selected. If and when the fib_alias instance is removed, this flag will be taken into account to decide whether the cache should be flushed.

fib_semantic_match 的返回值

Return value from fib_semantic_match

如前所述,返回值fib_semantic_match可以采用以下三种含义之一:

As stated earlier, the return value from fib_semantic_match can take one of three meanings:

  • 1 表示没有匹配的路由。

  • 1 means there is no matching route.

  • 0 表示成功。在这种情况下,查找结果存储在输入参数中res。结果包括指向匹配fib_info实例的指针。

  • 0 means success. In this case, the result of the lookup is stored in the input parameter res. The result includes a pointer to the matching fib_info instance.

  • 负值表示管理失败。

  • A negative value represents an administrative failure.

0 和负返回值均由找到的fa->fa_type匹配路由 () 的类型 ()确定。类型值的示例是和。从这种类型中,可以决定查找是否应该成功或失败,并且可以传回错误代码,以便内核在失败时采取适当的操作。fafib_semantic_matchRTN_UNICASTRTN_LOCALfib_semantic_match

Both 0 and the negative return values are determined from the type (fa->fa_type) of the matching route (fa) found by fib_semantic_match. Examples of the type value are RTN_UNICAST and RTN_LOCAL. From this type, fib_semantic_match can decide whether the lookup should succeed or fail, and can pass back an error code that allows the kernel to take the proper action in case of failure.

例如,类型的路由RTN_UNREACHABLE 会导致fib_semantic_match返回错误 - EHOSTUNREACH,然后导致内核生成 ICMP 不可达消息。类型的路由RTN_THROW会导致fib_semantic_match返回 error ,这指示net/ipv4/fib_rules.c-EAGAIN的策略路由版本重试使用下一个路由表进行查找。fib_lookup

For example, a route of type RTN_UNREACHABLE causes fib_semantic_match to return the error -EHOSTUNREACH, which then leads the kernel to generate an ICMP unreachable message. A route of type RTN_THROW causes fib_semantic_match to return the error-EAGAIN, which instructs the Policy Routing version of fib_lookup in net/ipv4/fib_rules.c to retry the lookup with the next routing table.

因为fa->fa_type类型字段驱动返回的值,所以错误代码体现在一个fib_props数组中,在文件net/ipv4/fib_semantics.c中定义和初始化(参见第 36 章中的“ rtable 结构”部分)。该数组包含每个可能的路由类型的一个元素,该元素指定关联的错误代码和范围。导出错误代码和范围就像引用 与索引对应的元素一样简单。RT_SCOPE_ XXXfib_propsfa->fa_type

Because the fa->fa_type type field drives the value returned, the error codes are embodied in a fib_props array, defined and initialized in the file net/ipv4/fib_semantics.c (see the section "rtable Structure" in Chapter 36). The array contains an element for each possible route type that specifies the associated error code and an RT_SCOPE_ XXX scope. Deriving the error code and scope is as simple as referencing the element of fib_props corresponding to the index fa->fa_type.

表 35-1显示了如何fib_props初始化。

Table 35-1 shows how fib_props is initialized.

表 35-1。fib_props 的初始化

Table 35-1. Initialization of fib_props

航线类型

Route type

错误

Error

范围

Scope

RTN_UNSPEC

RTN_UNSPEC

0

0

RT_SCOPE_NOWHERE

RT_SCOPE_NOWHERE

RTN_UNICAST

RTN_UNICAST

0

0

RT_SCOPE_UNIVERSE

RT_SCOPE_UNIVERSE

RTN_LOCAL

RTN_LOCAL

0

0

RT_SCOPE_HOST

RT_SCOPE_HOST

RTN_BROADCAST

RTN_BROADCAST

0

0

RT_SCOPE_LINK

RT_SCOPE_LINK

RTN_ANYCAST

RTN_ANYCAST

0

0

RT_SCOPE_LINK

RT_SCOPE_LINK

RTN_MULTICAST

RTN_MULTICAST

0

0

RT_SCOPE_UNIVERSE

RT_SCOPE_UNIVERSE

RTN_BLACKHOLE

RTN_BLACKHOLE

-EINVAL

-EINVAL

RT_SCOPE_UNIVERSE

RT_SCOPE_UNIVERSE

RTN_UNREACHABLE

RTN_UNREACHABLE

-EHOSTUNREACH

-EHOSTUNREACH

RT_SCOPE_UNIVERSE

RT_SCOPE_UNIVERSE

RTN_PROHIBIT

RTN_PROHIBIT

-EACCES

-EACCES

RT_SCOPE_UNIVERSE

RT_SCOPE_UNIVERSE

RTN_THROW

RTN_THROW

-EAGAIN

-EAGAIN

RT_SCOPE_UNIVERSE

RT_SCOPE_UNIVERSE

RTN_NAT

RTN_NAT

-EAGAIN

-EAGAIN

RT_SCOPE_NOWHERE

RT_SCOPE_NOWHERE

RTN_XRESOLVE

RTN_XRESOLVE

-EINVAL

-EINVAL

RT_SCOPE_NOWHERE

RT_SCOPE_NOWHERE

请注意,前几个元素的值为 0 error:在这些情况下,fib_semantic_match返回成功。其他有一个错误代码,由路由代码用来正确处理路由失败。

Note that the first few elements have a value of 0 for error: in these cases, fib_semantic_match returns success. The others have an error code used by the routing code to handle the routing failure correctly.

fib_lookup 函数

fib_lookup Function

正如第30章“特殊路由”一节中提到的,当不支持策略路由时,内核默认使用两个路由表。路由表查找仅由两次表查找(两次调用)组成,因此include/net/ip_fib.h中定义的函数非常简短:fn_hash_lookupfib lookup

As mentioned in the section "Special Routes" in Chapter 30, the kernel uses two routing tables by default when there is no support for Policy Routing. A routing table lookup simply consists of two table lookups (two calls to fn_hash_lookup), so the fib lookup function defined in include/net/ip_fib.h is quite brief:

静态内联 int fib_lookup(const struct flowi *flp, struct fib_result *res)
{
        if (ip_fib_local_table->tb_lookup(ip_fib_local_table, flp, res) &&
            ip_fib_main_table->tb_lookup(ip_fib_main_table, flp, res))
                返回--ENETUNREACH;
        返回0;
}
static inline int fib_lookup(const struct flowi *flp, struct fib_result *res)
{
        if (ip_fib_local_table->tb_lookup(ip_fib_local_table, flp, res) &&
            ip_fib_main_table->tb_lookup(ip_fib_main_table, flp, res))
                return --ENETUNREACH;
        return 0;
}

搜索关键字是flp。该函数首先检查ip_fib_local_table路由表,如果失败,则检查ip_fib_main_table路由表。如果两个表都无法找到匹配项,fib_loopkup 则返回—ENETUNREACH(无法到达目标网络)。

The search key is flp. The function first checks the ip_fib_local_table routing table, and if that fails, it checks the ip_fib_main_table routing table. If neither table manages to find a match, fib_loopkup returns —ENETUNREACH (unreachable destination network).

设置接收和发送功能

Setting Functions for Reception and Transmission

接收到的数据包和本地生成的数据包都需要进行路由:对于接收到的数据包,找出它们是否应该本地传递或转发;对于本地生成的数据包,找出它们是否应该本地传递或转发传送出去。

Both received packets and locally generated packets need to be routed: in the case of received packets, to find out whether they should be locally delivered or forwarded, and in the case of locally generated packets, to find out whether they should be delivered locally or transmitted out.

在这两种情况下,给定一个要路由的数据包skb,路由查找的结果将保存在 中skb->dst。这是一个type 的数据结构,在第36 章的“ dst_entry 结构dst_entry部分中有详细描述。该数据结构包括几个字段;其中两个是名为函数指针,根据路由查找的结果处理数据包。下一节将详细介绍这些函数指针的初始化。inputoutput

In both cases, given a packet to route skb, the result of the routing lookup is saved in skb->dst. This is a data structure of type dst_entry, described in detail in the section "dst_entry Structure" in Chapter 36. This data structure includes several fields; two of them are function pointers named input and output that process the packet in accordance with the result of the routing lookup. The next section goes into detail on the initializations of these function pointers.

然后“输入路由”和“输出路由”部分详细描述了例程ip_route_input_slowip_route_output_slow,分别用于在缓存查找失败时查找入口和出口数据包的路由。这两个函数可能有点可怕,因为它们的大小以及它们应用宏、条件代码(例如)和特殊功能处理(例如,多路径)的程度#ifdef,但它们实际上比看起来更简单。另外,不喜欢使用的程序员goto源代码中的语句可能会因其大量使用而感到失望。我为每个函数提供了一个流程图,在阅读这些函数之前,您可能需要先查看该流程图以获取高级描述。

Then the sections "Input Routing" and "Output Routing" describe in detail the routines ip_route_input_slow and ip_route_output_slow, used respectively to find routes for ingress and egress packets when the cache lookup fails. These two functions can be a bit scary because of their size and the extent to which they apply macros, conditional code (such as #ifdef), and special feature handling (e.g., Multipath), but they actually are simpler than they look. In addition, programmers who do not like the use of goto statements in the source code may be disappointed by their heavy use. I have included a flowchart for each function, which you might want to look at for high-level descriptions before reading about the functions.

在本节的其余部分中,我们将了解每个调用的虚拟函数 (dst->inputdst->output) 是如何初始化的,并了解有关何时调用它们的更多信息。它们设置的功能取决于几个因素,包括:

In the rest of this section, we will see how the virtual function each calls (dst->input and dst->output) are initialized, and learn more about when they are invoked. The functions to which they are set depend on a few factors, including:

  • 数据包是否正在发送、接收或转发

  • Whether the packet is being transmitted, received, or forwarded

  • 地址是单播还是组播

  • Whether the address is unicast or multicast

  • 路由查找时是否检测到错误

  • Whether an error is detected during the routing lookup

表 35-235-3dst->input列出了和dst->output可以初始化的例程。

Tables 35-2 and 35-3 list the routines to which dst->input and dst->output can be initialized.

表 35-2。用于 dst->input 的例程

Table 35-2. Routines used for dst->input

功能

Function

描述

Description

ip_local_deliver

ip_local_deliver

将数据包交付到本地。参见第 20 章

Deliver the packet locally. See Chapter 20.

ip_forward

ip_forward

转发单播数据包。参见第 20 章

Forward a unicast packet. See Chapter 20.

ip_mr_input

ip_mr_input

转发多播数据包。

Forward a multicast packet.

ip_error

ip_error

处理无法到达的目的地。请参阅“路由失败”部分。

Handle an unreachable destination. See the section "Routing Failure."

dst_discard_in

dst_discard_in

只需丢弃任何输入数据包即可。

Simply drop any input packet.

表 35-3。用于 dst->output 的例程

Table 35-3. Routines used for dst->output

功能

Function

描述

Description

ip_output

ip_output

包装周围ip_finish_ouput。参见第 21 章

Wrapper around ip_finish_ouput. See Chapter 21.

ip_mc_output

ip_mc_output

处理带有多播目标地址的出口数据包。

Handle egress packet with multicast destination address.

ip_rt_bug

ip_rt_bug

打印一条警告消息,因为不应该调用它。

Print a warning message, because it is not supposed to be called.

dst_discard_out

dst_discard_out

只需丢弃任何输入数据包即可。

Simply drop any input packet.

并非表 35-235-3中的所有功能组合都是可能的;图 35-3总结了有意义的内容。这些组合不包括例程,因为它们仅在独立于路由查找的特殊情况下找到(请参阅“特殊情况dst_discard_ xxx”部分)。

Not all combinations of the functions in Tables 35-2 and 35-3 are possible; Figure 35-3 summarizes the meaningful ones. These combinations do not include the dst_discard_ xxx routines because they are found only in special cases independent from routing lookups (see the section "Special Cases").

图 35-3显示了如何针对入口和出口流量初始化dst->input和。dst->output让我们一次看一个案例。

Figure 35-3 shows how dst->input and dst->output are initialized for ingress and egress traffic. Let's see one case at a time.

dst->输入和dst->输出初始化

图 35-3。dst->输入和dst->输出初始化

Figure 35-3. dst->input and dst->output initialization

入口流量函数指针的初始化

Initialization of Function Pointers for Ingress Traffic

我们在第 19 章中看到,入口 IP 流量由ip_rcv_finish. 该函数查阅路由表来决定数据包是在本地传送还是丢弃。该决定由 做出ip_route_input,它首先检查缓存,然后ip_route_input_slow在缓存未命中的情况下检查路由表 ()。ip_route_input_slow可以创建dst->input和的三种主要组合dst->ouput

We saw in Chapter 19 that ingress IP traffic is processed by ip_rcv_finish. This function consults the routing table to decide whether the packet is to be delivered locally or dropped. This decision is taken by ip_route_input, which first checks the cache and then the routing tables (ip_route_input_slow) in case of a cache miss. ip_route_input_slow can create three main combinations of dst->input and dst->ouput:

  • 如果要转发数据包,该函数将初始化dst->inputip_forwarddst->outputip_outputdst_input因此将调用ip_forward,从而间接结束调用dst_output,因此ip_output这就是图35-3中的情况(1) 。

  • If the packet is to be forwarded, the function initializes dst->input to ip_forward and dst->output to ip_output. dst_input will therefore call ip_forward, which indirectly ends up calling dst_output and therefore ip_output. This is case (1) in Figure 35-3.

  • 如果数据包要在本地传送,则该函数初始化dst->inputip_local_deliver。不需要初始化dst->output,但它无论如何都会初始化为ip_rt_error,这是一个在调用时打印错误消息的例程。dst->output这可以帮助检测处理本地传送的数据包时错误调用的错误。

  • If the packet is to be delivered locally, the function initializes dst->input to ip_local_deliver. There is no need to initialize dst->output, but it's initialized anyway to ip_rt_error, which is a routine that prints an error message when called. This can help detect bugs where dst->output is wrongly called when dealing with packets being delivered locally.

  • 如果根据路由表无法到达目标地址, dst->input则 被初始化为ip_error,这将生成一条 ICMP 消息,其类型取决于路由查找返回的确切结果。由于ip_error释放了skb缓冲区,因此无需初始化,dst->output因为即使错误也不会调用它。

  • If the destination address is not reachable according to the routing table, dst->input is initialized to ip_error, which generates an ICMP message whose type depends on the exact result returned by the routing lookup. Since ip_error frees the skb buffer, there is no need to initialize dst->output because it would not be called even by mistake.

出口流量函数指针的初始化

Initialization of Function Pointers for Egress Traffic

我们在第21章中看到IP层有几种不同的传输例程。图 35-3用作ip_queue_xmit示例,但无论调用哪个例程,它最终都会导致使用 进行路由查找 _ _ip_route_output_key,在缓存未命中的情况下,依赖于ip_route_output_slow。后一个函数可以创建dst->input和的四种主要组合dst->output

We saw in Chapter 21 that there are several different transmission routines on the IP layer. Figure 35-3 uses ip_queue_xmit as an example, but regardless of the routine invoked, it ultimately results in a routing lookup with _ _ip_route_output_key, which in case of a cache miss relies on ip_route_output_slow. The latter function can create four main combinations of dst->input and dst->output:

  • 如果目标是远程主机,则该函数初始化dst->outputip_output。这里不需要初始化dst->inputip_rt_error然而,使用假初始化来捕获错误之类的事情是有意义的,正如我们在“入口流量的函数指针的初始化”部分中看到的那样。

  • If the destination is a remote host, the function initializes dst->output to ip_output. Here there is no need to initialize dst->input. However, it would have made sense to use a fake initialization to something like ip_rt_error to catch bugs, as we saw in the section "Initialization of Function Pointers for Ingress Traffic."

  • 如果目标是本地系统,则该函数初始化 dst->outputip_output和。这是一个有趣的组合,有点像一个圆圈。当调用时,后者将数据包传输出环回设备,从而导致 和的执行。看到入口缓冲区中已经有路由信息,因此调用,进而调用. 这就是图35-3中的情况(2) 。dst->inputip_local_deliverdst_outputip_outputip_rcvip_rcv_finiship_rcv_finishskb->dstdst_inputip_local_deliver

  • If the destination is the local system, the function initializes dst->output to ip_output and dst->input to ip_local_deliver. This is an interesting combination that goes in something of a circle. When dst_output calls ip_output, the latter transmits the packet out the loopback device, leading to the execution of ip_rcv and ip_rcv_finish. ip_rcv_finish sees that the ingress buffer already has routing information in skb->dst, and therefore calls dst_input, which in turn invokes ip_local_deliver. This is case (2) in Figure 35-3.

  • 如果目标地址是本地配置的多播 IP 地址,则该函数初始化dst->outputip_mc_output。然后多播代码负责处理该数据包。dst->input未初始化。

  • If the destination address is a locally configured multicast IP address, the function initializes dst->output to ip_mc_output. Multicast code then takes care of the packet. dst->input is not initialized.

  • 当内核编译为支持多播路由时,相同多播情况的处理方式略有不同。在这种情况下,dst->output仍被初始化为ip_mc_output,但dst->input也被初始化为例程ip_mr_input

  • The same multicast case is handled slightly differently when the kernel is compiled with support for multicast routing. In this case, dst->output is still initialized to ip_mc_output, but dst->input is initialized as well, to the routine ip_mr_input.

特别案例

Special Cases

dst当不应该使用缓存的路由时,dst->output被初始化为dst_discard_out并被dst->input初始化为dst_discard_in。这两个例程都只是丢弃它们传递的任何数据包。使用它们的一个例子是,当要删除缓存的路由但无法销毁它时,因为仍然有对其的引用(请参阅第 33 章中的“删除DST 条目”一节)。

When a cached route dst is not supposed to be used, dst->output is initialized to dst_discard_out and dst->input is initialized to dst_discard_in. Both routines simply drop any packet they are passed. One example of their use is when a cached route is to be removed but cannot be destroyed because there are still references left to it (see the section "Deleting DST Entries" in Chapter 33).

当分配了新条目但由于尚未完全初始化而未准备好使用时,也会使用这两个例程(请参阅 参考资料dst_alloc)。

These two routines are also used when a new entry is allocated and is not ready to be used because it is not fully initialized yet (see dst_alloc).

输入和输出路由例程的一般结构

General Structure of the Input and Output Routing Routines

我们在第 33 章的“缓存查找”部分中看到,缓存无法满足的入口和出口路由查找分别由和处理。ip_route_input_slowip_route_output_slow

We saw in the section "Cache Lookup" in Chapter 33 that ingress and egress routing lookups that cannot be satisfied by the cache are taken care of by ip_route_input_slow and ip_route_output_slow, respectively.

两个例程都相当长。为了使它们更具可读性,其部分代码已移至两个内联[ * ]函数,分别称为ip_mkroute_inputip_mkroute_output。这两个例程都区分内核支持多路径缓存的情况和不支持的情况。ip_mkroute_input_def在后一种情况下,它们分别成为两个例程和 的别名ip_mkroute_output_def。无论是否支持多路径缓存,都会使用 和 分配并初始化路由缓存_ _mkroute_input条目_ _mkroute_output。无论它是否ip_route_input_slow触发ip_route_output_slow将新条目插入到缓存中,该操作都是由 执行的rt_intern_hash

Both routines are pretty long. To make them more readable, part of their code has been moved to two inline[*] functions, called ip_mkroute_input and ip_mkroute_output, respectively. Both routines differentiate between the case where the kernel supports multipath caching and the case where it does not. In the latter case, they become an alias to the two routines ip_mkroute_input_def and ip_mkroute_output_def, respectively. Regardless of whether multipath caching is supported, the routing cache entry is allocated and initialized with _ _mkroute_input and _ _mkroute_output. Regardless of whether it is ip_route_input_slow or ip_route_output_slow that triggers the insertion of a new entry into the cache, that operation is performed by rt_intern_hash.

图 35-4总结了本节中的内容,并显示了两个慢速例程的骨架的对称程度。

Figure 35-4 summarizes the material in this section and shows how symmetrical the skeletons of the two slow routines are.

ip_route_input_slow 和 ip_route_output_slow 的骨架

图 35-4。ip_route_input_slow 和 ip_route_output_slow 的骨架

Figure 35-4. Skeleton of ip_route_input_slow and ip_route_output_slow

与多路径相关的差异,例如在 中缺少对 in 的fib_select_multipath调用,将在“多路径缓存”部分中进行解释。ip_mkroute_input_defip_mkroute_output_def

The differences with regard to Multipath, like the call to fib_select_multipath in ip_mkroute_input_def that is missing in ip_mkroute_output_def, will be explained in the section "Multipath Caching."

输入路由

Input Routing

在缓存中找不到路由的入口 IP 数据包ip_route_input将根据路由表进行检查 ,该路由表在net/ipv4/route.cip_route_input_slow中定义,其逻辑如图 35-5(a)35-5所示(二) . 在本节中,我们详细描述该例程的内部结构。

Ingress IP packets for which no route can be found in the cache by ip_route_input are checked against the routing tables by ip_route_input_slow, which is defined in net/ipv4/route.c and whose logic is shown in Figures 35-5(a) and 35-5(b). In this section, we describe the internals of this routine in detail.

ip_route_input_slow 函数

图 35-5a。ip_route_input_slow 函数

Figure 35-5a. ip_route_input_slow function

ip_route_input_slow 函数

图 35-5b。ip_route_input_slow 函数

Figure 35-5b. ip_route_input_slow function

该函数首先对源地址和目标地址进行一些健全性检查;例如,源IP地址不能是组播地址。我已经在第 31 章的“详细监控”部分列出了大部分检查。稍后将在函数中进行更多健全性检查。

The function starts with a few sanity checks on the source and destination addresses; for instance, the source IP address must not be a multicast address. I already listed most of those checks in the section "Verbose Monitoring" in Chapter 31. More sanity checks are done later in the function.

路由表查找是通过“ fib_lookup 函数fib_lookup”部分中介绍的例程完成的。如果找不到匹配的路由,则丢弃该数据包;此外,如果接收接口配置为启用转发, 则会将消息发送回源。请注意,ICMP 消息不是由其调用者发送,而是由调用者发送,调用者在看到返回值 后会处理该消息。fib_lookupICMP_UNREACHABLEip_route_input_slowRTN_UNREACHABLE

The routing table lookup is done with fib_lookup, the routine introduced in the section "fib_lookup Function." If fib_lookup cannot find a matching route, the packet is dropped; additionally, if the receiving interface is configured with forwarding enabled, an ICMP_UNREACHABLE message is sent back to the source. Note that the ICMP message is sent not by ip_route_input_slow but by its caller, who takes care of it upon seeing a return value of RTN_UNREACHABLE.

如果成功,ip_route_input_slow 区分以下三种情况:

In case of success, ip_route_input_slow distinguishes the following three cases:

  • 发送至广播地址的数据包

  • Packet addressed to a broadcast address

  • 发往本地地址的数据包

  • Packet addressed to a local address

  • 发往远程地址的数据包

  • Packet addressed to a remote address

前两种情况,报文需要在本地下发,第三种情况,需要进行转发。有关如何处理本地传递和转发的详细信息,请参阅“本地传递”和“转发”部分。以下是他们都需要处理的一些任务:

In the first two cases, the packet is to be delivered locally, and in the third, it needs to be forwarded. The details of how local delivery and forwarding are handled can be found in the sections "Local Delivery" and "Forwarding." Here are some of the tasks they both need to take care of:

健全性检查,尤其是源地址
Sanity checks, especially on the source address

检查源地址是否存在非法值,并进行遍历fib_validate_source以检测欺骗尝试。

Source addresses are checked against illegal values and are run through fib_validate_source to detect spoofing attempts.

创建并初始化一个新的缓存条目(局部变量rth
Creation and initialization of a new cache entry (the local variable rth)

请参阅以下部分“缓存条目的创建”。

See the following section, "Creation of a Cache Entry."

创建缓存条目

Creation of a Cache Entry

我已经在第 33 章的“缓存查找”部分说过(因此,在缓存未命中的情况下)可以仅调用它来查询路由表,而不必路由入口数据包。因此,并不总是创建新的缓存条目。当从 IP 或 L4 协议(例如 IP over IP)调用时,该函数始终创建一个缓存条目。目前,唯一的其他可能性是通过 ARP 调用。ARP 生成的路由仅在对代理 ARP 有效时才会被缓存。请参阅第 28 章中的“处理 ARPOP_REQUEST 数据包”部分。ip_route_inputip_route_input_slowip_route_input_slow

I said already in the section "Cache Lookup" in Chapter 33 that ip_route_input (and therefore ip_route_input_slow, in case of a cache miss) can be called just to consult the routing table, not necessarily to route an ingress packet. Because of that, ip_route_input_slow does not always create a new cache entry. When invoked from IP or an L4 protocol (such as IP over IP), the function always creates a cache entry. Currently, the only other possibility is invocation by ARP. Routes generated by ARP are cached only when they would be valid for proxy ARP. See the section "Processing ARPOP_REQUEST Packets" in Chapter 28.

新条目分配有dst_alloc。特别重要的是新缓存条目的以下初始化:

The new entry is allocated with dst_alloc. Of particular importance are the following initializations for the new cache entry:

rth->u.dst.input
rth->u.dst.input

rth->u.dst.output
rth->u.dst.output

dst_input这两个虚函数分别由和调用,dst_output完成入口和出口报文的处理,如第18章18-1所示。我们已经在“接收和传输的设置函数”部分中看到了如何根据数据包是要转发、本地传递还是丢弃来初始化这两个例程。

These two virtual functions are invoked respectively by dst_input and dst_output to complete the processing of ingress and egress packets, as shown in Figure 18-1 in Chapter 18. We already saw in the section "Setting Functions for Reception and Transmission" how these two routines can be initialized depending on whether a packet is to be forwarded, delivered locally, or dropped.

rth->fl
rth->fl

flowi结构用作缓存查找的搜索键。需要注意的是,rth->fl的字段被初始化为 接收的输入参数ip_route_input_slow:这确保了下次使用相同的参数进行查找时,ip_route_input将能够通过缓存查找来满足它。

This flowi structure is used as a search key by cache lookups. It is important to note that rth->fl's fields are initialized to the input parameters received by ip_route_input_slow: this ensures that the next time a lookup is done with the same parameters, ip_route_input will be able to satisfy it with a cache lookup.

rth->rt_spec_dst
rth->rt_spec_dst

这是首选源地址。请参阅以下部分“首选源地址选择”。

This is the preferred source address. See the following section, "Preferred Source Address Selection."

首选源地址选择

Preferred Source Address Selection

添加到路由缓存的路由是单向的,这意味着它不会用于将流量反向路由到正在路由的数据包的源 IP 地址。但是,在某些情况下,数据包的接收可能会触发一项操作,要求本地主机选择在将数据包传回发送方时可以使用的源 IP 地址。[ * ]该地址,首选源 IP 地址[ ]必须与路由入口数据包的路由缓存条目一起保存。在以下两种情况下,保存在名为 的字段中的地址rt_spec_dst会派上用场:

The route added to the routing cache is unidirectional, meaning that it will not be used to route traffic in the reverse direction toward the source IP address of the packet being routed. However, in some cases, the reception of a packet can trigger an action that requires the local host to choose a source IP address that it can use when transmitting a packet back to the sender.[*] This address, the preferred source IP address,[] must be saved with the routing cache entry that routed the ingress packet. Here are two cases where that address, which is saved in a field called rt_spec_dst, comes in handy:

ICMP
ICMP

当主机收到 ICMP ECHO REQUEST 消息(根据通常生成这些消息的命令名称,通常称为“ping”)时,主机将返回 ICMP ECHO REPLY,除非明确配置为不这样做。用于入口 ICMP ECHO REQUEST 的路由的地址rt_spec_dst用作路由查找的源地址,以路由 ICMP ECHO REPLY。参见icmp_replynet /ipv4/icmp.c,并参见第 25 章net/ipv4/ip_output.cip_send_reply中的例程执行类似的操作。

When a host receives an ICMP ECHO REQUEST message (popularly known as "pings" from the name of the command that usually generates them), the host returns an ICMP ECHO REPLY unless it is explicitly configured not to. The rt_spec_dst of the route used for the ingress ICMP ECHO REQUEST is used as the source address for the routing lookup made to route the ICMP ECHO REPLY. See icmp_reply in net/ipv4/icmp.c, and see Chapter 25. The ip_send_reply routine in net/ipv4/ip_output.c does something similar.

IP选项
IP options

一些 IP 选项要求源和目标之间的中间主机将其接收接口的 IP 地址写入 IP 标头。Linux 写入的地址是rt_spec_dst参见第19章ip_options_compile的描述。

A couple of IP options require the intermediate hosts between the source and the destination to write the IP addresses of their receiving interfaces into the IP header. The address that Linux writes is rt_spec_dst. See the description of ip_options_compile in Chapter 19.

首选源是通过“助手例程fib_validate_source”部分中提到的函数选择的,并由 调用。ip_route_input_slow

The preferred source is selected through the fib_validate_source function mentioned in the section "Helper Routines" and called by ip_route_input_slow.

ip_route_input_slowrt_spec_dst根据正在路由的数据包的目标地址初始化首选源 IP 地址:

ip_route_input_slow initializes the preferred source IP address rt_spec_dst based on the destination address of the packet being routed:

发往本地地址的数据包
Packet addressed to a local address

在这种情况下,数据包寻址到的本地地址将成为首选源地址。(前面引用的 ICMP 示例就属于这种情况。)

In this case, the local address to which the packet was addressed becomes the preferred source address. (The ICMP example previously cited falls into this case.)

广播包
Broadcast packet

广播地址不能用作出口数据包的源地址,因此在这种情况下,在ip_route_input_slow其他两个例程的帮助下进行更多调查:inet_select_addr和(请参阅“帮助例程fib_validate_source”部分)。

当接收到的数据包中未设置源 IP 地址时(即,当它全部为零时),选择在接收数据包的设备上inet_select_addr具有范围的第一个地址。RT_SCOPE_LINK这是因为当发送到有限广播地址(范围为 的地址)时,数据包会使用空源地址发送RT_SCOPE_LINK。一个示例是 DHCP 发现消息。

当源地址不全为零时,fib_validate_source请小心处理。

A broadcast address cannot be used as a source address for egress packets, so in this case, ip_route_input_slow does more investigation with the help of two other routines: inet_select_addr and fib_validate_source (see the section "Helper Routines").

When the source IP address is not set in the received packet (that is, when it is all zeroes), inet_select_addr selects the first address with scope RT_SCOPE_LINK on the device the packet was received from. This is because packets are sent with a null source address when addressed to the limited broadcast address, which is an address with scope RT_SCOPE_LINK. An example is a DHCP discovery message.

When the source address is not all zeroes, fib_validate_source take cares of it.

转发数据包
Forwarded packet

在这种情况下,选择留给fib_validate_source。(前面引用的 IP 选项示例就属于这种情况。)

用户可以使用如下命令显式配置用于匹配给定路由的数据包的首选源 IP:

In this case, the choice is left to fib_validate_source. (The IP options example previously cited falls into this case.)

The preferred source IP to use for packets matching a given route can be explicitly configured by the user with a command like this:

ip 路由通过 10.0.0.1 src 10.0.3.100 添加 10.0.1.0/24
ip route add 10.0.1.0/24 via 10.0.0.1 src 10.0.3.100

在本例中,当向 10.0.1.0/24 子网的主机传输数据包时,内核将使用 10.0.3.100 作为源 IP 地址。当然,只接受本地配置的地址:这意味着要接受上一个命令,必须在其中一个本地接口上配置 10.0.3.100,但不一定在用于到达 10.0.1.0/ 的同一设备上24个子网。(请记住,在 Linux 中,地址属于主机,而不属于设备;请参阅第28 章中的“从多个接口响应” 一节。)当管理员不想使用将要使用的地址时,通常会提供一个源地址。默认从出口设备选取。

In this example, when transmitting packets to the hosts of the 10.0.1.0/24 subnet, the kernel will use 10.0.3.100 as the source IP address. Of course, only locally configured addresses are accepted: this means that for the previous command to be accepted, 10.0.3.100 must have been configured on one of the local interfaces, but not necessarily on the same device used to reach the 10.0.1.0/24 subnet. (Remember that in Linux, addresses belong to the host, not to the devices; see the section "Responding from Multiple Interfaces" in Chapter 28.) An administrator normally provides a source address when she does not want to use the one that would be picked by default from the egress device.

图 35-6总结了如何 rt_spec_dst选择。

Figure 35-6 summarizes how rt_spec_dst is selected.

本地配送

Local Delivery

以下类型的数据包通过适当的初始化在本地传送,正如我们在“入口流量的函数指针的初始化dst->input”部分中看到的:

The following types of packets are delivered locally by initializing dst->input appropriately, as we saw in the section "Initialization of Function Pointers for Ingress Traffic":

  • 发送至本地配置地址(包括多播地址)的数据包

  • Packets addressed to locally configured addresses, including multicast addresses

  • 发送至广播地址的数据包

  • Packets addressed to broadcast addresses

rt_spec_dst 的选择

图 35-6。rt_spec_dst 的选择

Figure 35-6. Selection of rt_spec_dst

ip_route_input_slow识别两种广播:

ip_route_input_slow recognizes two kinds of broadcasts:

有限播放
Limited broadcasts

这是一个由全 1 组成的地址:255.255.255.255。[ * ]无需调用即可轻松识别fib_lookup。有限的广播将传送到链路上的任何主机,无论主机配置在哪个子网中。不需要查表。

This is an address consisting of all ones: 255.255.255.255.[*] It can be recognized easily without a call to fib_lookup. Limited broadcasts are delivered to any host on the link, regardless of the subnet the host is configured on. No table lookup is required.

子网广播
Subnet broadcasts

这些广播针对特定子网上配置的主机。如果主机配置在可通过同一设备访问的不同子网上(请参见第 30 章中的图 30-4(c)),则只有正确的主机才会收到子网广播。与有限广播不同,子网广播 不涉及路由表就无法识别fib_lookup。例如,地址 10.0.1.127 可能是 10.0.1.0/25 中的子网广播,但不是 10.0.1.0/24 中的子网广播。

These broadcasts are directed at hosts configured on a specific subnet. If hosts are configured on different subnets reachable via the same device (see Figure 30-4(c) in Chapter 30), only the right ones will receive a subnet broadcast. Unlike a limited broadcast, subnet broadcasts cannot be recognized without involving the routing table with fib_lookup. For example, the address 10.0.1.127 might be a subnet broadcast in 10.0.1.0/25, but not in 10.0.1.0/24.

ip_route_input_slow仅接受由 IP 协议生成的广播。您可能认为这是一个多余的检查,因为ip_route_input_slow调用它来路由 IP 数据包。然而,正如我在第 33 章的“缓存查找”部分中所说,输入缓冲区(因此在缓存未命中的情况下)不一定代表要路由的数据包。ip_route_inputip_route_input_slow

ip_route_input_slow accepts broadcasts only if they are generated by the IP protocol. You might think that this a superfluous check, given that ip_route_input_slow is called to route IP packets. However, as I said in the section "Cache Lookup" in Chapter 33, the input buffer to ip_route_input (and therefore to ip_route_input_slow in case of a cache miss) does not necessarily represent a packet to be routed.

如果一切顺利,rtable则会创建、初始化一个新的缓存条目,并将其插入到路由缓存中。

If everything goes fine, a new cache entry, rtable, is created, initialized, and inserted into the routing cache.

请注意,无需为本地传送的数据包处理多路径。

Note that there is no need to handle Multipath for packets that are delivered locally.

转发

Forwarding

如果数据包要转发,但入口设备的配置已禁用转发 ,数据包无法传输,必须被丢弃。使用 检查设备的转发状态IN_DEV_FORWARD图 35-7显示了 的内部结构 ip_mkroute_input;特别是,它显示了当不支持多路径缓存时(即,当ip_mkroute_input最终成为 的别名时ip_mkroute_input_def)该函数的样子。在“多路径缓存”部分中,您将看到其他情况的不同之处。

If the packet is to be forwarded but the configuration of the ingress device has disabled forwarding , the packet cannot be transmitted and must be dropped. The forwarding status of the device is checked with IN_DEV_FORWARD. Figure 35-7 shows the internals of ip_mkroute_input; in particular, it shows what that function looks like when there is no support for multipath caching (i.e., when ip_mkroute_input ends up being an alias to ip_mkroute_input_def). In the section "Multipath Caching," you will see how the other case differs.

如果返回的匹配路由fib_lookup 包含多个下一跳,fib_select_multipath 则用于在其中进行选择。当支持多路径缓存时,将以不同方式处理选择。“多路径对下一跳选择的影响”部分描述了用于选择的算法。

If the matching route returned by fib_lookup includes more than one next hop, fib_select_multipath is used to choose among them. When multipath caching is supported, the selection is taken care of differently. The section "Effects of Multipath on Next Hop Selection" describes the algorithm used for the selection.

源地址通过 进行验证fib_validate_source。然后,根据我们在第 31 章“传输 ICMP_REDIRECT 消息” 部分中看到的因素,内核可能决定向源发送一个消息。在这种情况下,ICMP 消息不是 直接发送的,而是由 发送的,它会在看到标志后处理它。ICMP_REDIRECTip_route_input_slowip_forwardRTCF_DOREDIRECT

The source address is validated with fib_validate_source. Then, based on the factors we saw in the section "Transmitting ICMP_REDIRECT Messages" in Chapter 31, the kernel may decide to send an ICMP_REDIRECT to the source. In that case, the ICMP message is sent not by ip_route_input_slow directly, but by ip_forward, which takes care of it upon seeing the RTCF_DOREDIRECT flag.

正如我们在“创建缓存条目”部分中看到的,路由查找的结果并不总是被缓存。

As we saw in the section "Creation of a Cache Entry," the result of a routing lookup is not always cached.

路由失败

Routing Failure

当由于主机配置或没有路由匹配而无法路由数据包时,新路由将添加到缓存中并dst->input初始化为ip_error。这意味着匹配此路由的所有入口数据包都将由 处理ip_error。当由 调用时dst_input,该函数将ICMP_UNREACHABLE根据数据包无法路由的原因生成正确的消息,并丢弃该数据包。将错误路由添加到缓存非常有用,因为它可以加快对发送到同一错误地址的其他数据包的错误处理速度。

When a packet cannot be routed, either because of host configuration or because no route matches, the new route is added to the cache with dst->input initialized to ip_error. This means that all the ingress packets matching this route will be processed by ip_error. That function, when invoked by dst_input, will generate the proper ICMP_UNREACHABLE message depending on why the packet cannot be routed, and will drop the packet. Adding the erroneous route to the cache is useful because it can speed up the error processing of further packets sent to the same incorrect address.

ICMP 消息的速率受 限制ip_error我们已经在第 33 章的“出口 ICMP 重定向速率限制”部分中看到,消息也受到 DST 的速率限制。这里讨论的速率限制是独立于其他的,但是使用. 这是可能的,因为给定任何路由,这两种形式的速率限制是相互排斥的:一种适用于消息,另一种适用于消息。ICMP_REDIRECTdst_entryICMP_REDIRECTICMP_UNREACHABLE

ICMP messages are rate limited by ip_error. We already saw in the section "Egress ICMP REDIRECT Rate Limiting" in Chapter 33 that ICMP_REDIRECT messages are also rate limited by the DST. The rate limiting discussed here is independent of the other, but is enforced using the same fields of the dst_entry. This is possible because given any route, these two forms of rate limiting are mutually exclusive: one applies to ICMP_REDIRECT messages and the other one applies to ICMP_UNREACHABLEmessages.

ip_error以下是如何通过简单的令牌桶算法实现速率限制。

Here is how rate limiting is implemented by ip_error with a simple token bucket algorithm.

dst.rate_last每次 ip_error调用生成 ICMP 消息时 都会更新时间戳。dst.rate_tokens指定在速率限制生效之前可以发送多少 ICMP 消息(也称为令牌数量或预算)并且新的ICMP_UNREACHABLE 传输请求将被忽略。ICMP_UNREACHABLE每次发送消息 时预算都会减少,并ip_error自行增加。预算不能超过最大数量ip_rt_error_burst,顾名思义,它代表主机在 1 秒内可以发送的 ICMP 消息的最大数量(即突发)。它的值以Hz表示,这样可以很容易地根据本地时间的差异添加令牌jiffiesdst.rate_last

The timestamp dst.rate_last is updated every time ip_error is invoked to generate an ICMP message. dst.rate_tokens specifies how many ICMP messages—also known as the number of tokens, or the budget—can be sent before the rate limiting kicks in and new ICMP_UNREACHABLE transmission requests will be ignored. The budget is decremented each time an ICMP_UNREACHABLE message is sent, and is incremented by ip_error itself. The budget cannot exceed the maximum number ip_rt_error_burst, which represents, as its name suggests, the maximum number of ICMP messages a host can send in 1 second (i.e., the burst). Its value is expressed in Hz so that it is easy to add tokens based on the difference between the local time jiffies and dst.rate_last.

ip_mkroute_input函数

图 35-7。ip_mkroute_input函数

Figure 35-7. ip_mkroute_input function

ip_error调用 且至少有一个令牌可用时,允许该函数传输ICMP_UNREACHABLE消息。ICMP 子类型派生自dst.error,它由ip_route_input_slowwhen fib_lookup failed to find a Route 初始化。

When ip_error is invoked and at least one token is available, the function is allowed to transmit an ICMP_UNREACHABLE message. The ICMP subtype is derived from dst.error, which was initialized by ip_route_input_slow when fib_lookup failed to find a route.

输出路由

Output Routing

ip_route_output_slow本地生成的数据包通过if进行路由,该例程是我们在“出口流量的函数指针的初始化_ _ip_route_output_key”部分中介绍的例程,遇到缓存未命中。的结构有点类似。该功能的高级概述如图35-8(a)35-8(b)所示。ip_route_output_slowip_route_input_slow

Packets generated locally are routed with ip_route_output_slow if _ _ip_route_output_key, the routine we introduced in the section "Initialization of Function Pointers for Egress Traffic," encounters a cache miss. The structure of ip_route_output_slow somewhat resembles ip_route_input_slow. A high-level overview of the function is shown in Figures 35-8(a) and 35-8(b).

ip_route_output_slow 函数

图 35-8a。ip_route_output_slow 函数

Figure 35-8a. ip_route_output_slow function

p_route_output_slow 函数

图 35-8b。p_route_output_slow 函数

Figure 35-8b. p_route_output_slow function

在接下来的几节中,我们将详细研究ip_route_output_slow在本地传递数据包或将其传输出去需要做什么。本地传递和转发都必须执行以下任务,尽管它们可能以不同的方式执行:

In the next few sections, we will examine in detail what ip_route_output_slow needs to do to deliver a packet locally or transmit it out. Both local delivery and forwarding have to perform the following tasks, though they may do so in different ways:

  • 从匹配的路由中选择要使用的出口设备。

  • Select the egress device to use from the route that matches.

  • 根据搜索路由的范围选择要使用的源 IP 地址。

  • Select the source IP address to use, based on the scope of the route being searched.

  • 创建并初始化一个新的缓存表条目并将其插入缓存中。

  • Create and initialize a new cache table entry and insert it into the cache.

图35-8被虚线分成三部分。顶部部分a填充搜索键的字段,这些字段在传递给函数时尚未初始化。中心部分b进行路由表查找,并在需要时选择多路径路由中的下一跳或默认网关。底部部分 c创建新的缓存表条目。底部部分还初始化dst->inputdst->output基于函数中先前采取的转发决策的结果,并由函数主要通过本地标志变量进行跟踪。

Figure 35-8 is split into three parts by dotted lines. The top part, a, fills in the fields of the search key that are not already initialized when it is passed to the function. The central part, b, makes a routing table lookup and, if needed, selects the next hop in a multipath route or the default gateway. The bottom part, c, creates the new cache table entry. The bottom part also initializes dst->input and dst->output based on the result of the forwarding decisions taken earlier in the function and tracked by the function mostly through a local flags variable.

在少数情况下,无需任何路由查找即可对数据包进行路由(即无需调用fib_lookup图中的中心部分)。这是三种这样的情况,全部如图 35-8所示:

In a few cases, a packet can be routed without the need for any routing lookup (i.e., no need to call fib_lookup, the central part of the figure). These are three such cases, all depicted in Figure 35-8:

当出口设备未提供搜索密钥时,发送至多播或有限广播地址的数据包
Packets addressed to a multicast or limited broadcast address, when the egress device is not provided with the search key

这个案例是一个 hack,它绕过了vicvat等多媒体工具所做的假设问题。函数代码中的注释解释了该问题。请参阅第 26 章中的“特殊情况”部分。

This case is a hack that gets around a problem with assumptions made by multimedia tools such as vic and vat. A comment in the function's code explains the problem. See the section "Special Cases" in Chapter 26.

发送至本地多播地址(即 224.0.0.X)或有限广播地址(即 255.255.255.255)的数据包在给定设备上发出
Packets addressed to a local multicast address (i.e., 224.0.0.X) or the limited broadcast address (i.e., 255.255.255.255) going out on a given device

由于出口设备由呼叫者与搜索关键字一起提供,并且由于本地多播和有限广播是具有范围 的地址 RT_SCOPE_LINK,因此下一跳由目标地址本身表示。因此,路由子系统已经拥有路由数据包所需的所有信息,不需要进行查找。[ * ]有关多播地址的讨论,请参阅第 30 章中的“路由的基本要素”部分。

Because the egress device is provided by the caller along with the search key, and because local multicasts and limited broadcasts are addresses with scope RT_SCOPE_LINK, the next hop is represented by the destination address itself. Therefore, the routing subsystem already has all the information needed to route the packet and does not need to do a lookup.[*] See the section "Essential Elements of Routing" in Chapter 30 for a discussion of multicast addresses.

发送至未知地址 (0.0.0.0 [ ] ) 的数据包。
Packets addressed to the unknown address (0.0.0.0[]).

这些数据包在本地传送。他们没有被发送出去。

Those packets are delivered locally. They are not sent out.

搜索键初始化

Search Key Initialization

这就是ip_route_input_slow初始化它传递给fib_lookup路由表查找的搜索键的方式。相同的密钥将与新的缓存路由一起保存,以便使用缓存进行后续查找。

This is how ip_route_input_slow initializes the search key that it passes to fib_lookup for the routing table lookup. The same key will be saved along with the new cached route for subsequent lookups using the cache.

    u32 tos = RT_FL_TOS(oldflp);
    结构流i fl = { .nl_u = { .ip4_u =
                      { .daddr = oldflp->fl4_dst,
                    .saddr = oldflp->fl4_src,
                    .tos = tos & IPTOS_RT_MASK,
                    .scope = ((tos & RTO_ONLINK) ?
                          RT_SCOPE_LINK:
                          RT_SCOPE_UNIVERSE),
#ifdef CONFIG_IP_ROUTE_FWMARK
                    .fwmark = oldflp->fl4_fwmark
#万一
                      } },
                .iif = Loopback_dev.ifindex,
                .oif = oldflp->oif };
    u32 tos    = RT_FL_TOS(oldflp);
    struct flowi fl = { .nl_u = { .ip4_u =
                      { .daddr = oldflp->fl4_dst,
                    .saddr = oldflp->fl4_src,
                    .tos = tos & IPTOS_RT_MASK,
                    .scope = ((tos & RTO_ONLINK) ?
                          RT_SCOPE_LINK :
                          RT_SCOPE_UNIVERSE),
#ifdef CONFIG_IP_ROUTE_FWMARK
                    .fwmark = oldflp->fl4_fwmark
#endif
                      } },
                .iif = loopback_dev.ifindex,
                .oif = oldflp->oif };

源和目标 IP 地址以及防火墙标记只是从函数的输入中复制的。然而,TOS 和范围的设置需要一些解释:

The source and destination IP addresses and the firewall mark are just copied from the function's input. The setting of the TOS and scope, however, needs a little explanation:

服务条款
TOS

fl4_tos调用者可以使用该字段的两个最低有效位来存储ip_route_output_slow可以考虑确定范围的标志搜索的路线。这是可能的,因为 TOS 字段本身不需要整个八位字节。请参阅第 33 章中的“出口查找”部分,以及第 18 章中的图 18-3

该宏在net/ipv4/route.cRF_FL_TOS中定义如下:

#define RF_FL_TOS(oldflp) \
((u32)(oldflp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK)

The two least significant bits of the fl4_tos field can be used by the caller to store flags that ip_route_output_slow can take into account to determine the scope of the route to search. This is possible because the TOS field itself does not need the whole octet. See the section "Egress lookup" in Chapter 33, and see Figure 18-3 in Chapter 18.

The RF_FL_TOS macro is defined in net/ipv4/route.c as follows:

#define RF_FL_TOS(oldflp) \
((u32)(oldflp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK)
范围
Scope

当该RTO_ONLINK标志被设置时,搜索路径的范围被设置为RT_SCOPE_LINK;否则,它被初始化为RT_SCOPE_UNIVERSE。有关 ARP 的示例,请参阅第 33 章中的“出口查找”部分。

When the RTO_ONLINK flag is set, the scope of the route to search is set to RT_SCOPE_LINK; otherwise, it is initialized to RT_SCOPE_UNIVERSE. See the section "Egress lookup" in Chapter 33 for an example involving ARP.

由于ip_route_output_slow仅调用它来路由本地生成的流量,因此搜索关键字中的源设备fl被初始化为环回设备。正如我们将看到的,当目标地址也是本地时,出口设备也被初始化为环回设备。

Because ip_route_output_slow is called only to route locally generated traffic, the source device in the search key fl is initialized to the loopback device. As we will see, when the destination address is also local, the egress device is also initialized to the loopback device.

图35-8(a)显示了当搜索关键字的基本字段没有提供输入关键字时,它们是如何初始化的。

Figure 35-8(a) shows how basic fields of the search key are initialized when they are not provided with the input key.

选择源IP地址

Selecting the Source IP Address

用于搜索关键字的源IP地址也是放入所传输的数据包的IP标头中的源IP地址。因此,在 的开头部分ip_route_output_slow,该函数从搜索关键字 中选择源 IP 地址(如果存在)fl.fl4_src;之后它会初始化rth->rt_src为相同的值。

The source IP address used for the search key is also the source IP address put into the IP header of the transmitted packets. In the initial part of ip_route_output_slow, therefore, the function selects the source IP address, if present, from the search key fl.fl4_src; later it initializes rth->rt_src to the same value.

当搜索键不提供源 IP 地址时,[ * ]函数通过调用inet_select_addr [ ]来选择源 IP 地址,输入取决于目标地址类型。特别是, 使用以下范围进行调用来处理特殊情况ip_route_output_slowinet_select_addr

When the search key does not provide a source IP address,[*] the function chooses it by calling inet_select_addr [] with input that depends on the destination address type. In particular, ip_route_output_slow invokes inet_select_addr with the following scopes to handle special cases:

  • RT_SCOPE_HOST当数据包要在本地传送时(请参阅“本地传送”部分)。

  • RT_SCOPE_HOST when the packet is to be delivered locally (see the section "Local Delivery").

  • RT_SCOPE_LINK当数据包发送到仅在本地链路上有意义的地址时,例如广播、有限广播和本地多播。当失败但数据包无论如何都会传输时,也会使用此范围fib_lookup,因为搜索关键字提供了出口设备,因此目的地应该位于链路上(请参阅“传输到其他主机”部分)。

  • RT_SCOPE_LINK when the packet is sent to an address that is meaningful only on the local link, such as broadcasts, limited broadcasts, and local multicasts. This scope is also used when fib_lookup fails but a packet is transmitted anyway, because the search key provides the egress device and the destination is therefore supposed to be on the link (see the section "Transmission to Other Hosts").

当要路由的数据包不属于刚刚列出的两种特殊情况时, ip_route_output_slow通过调用 来选择源 IP 地址,并将路由搜索FIB_RES_PREFSRC结果传递给它 。 使用各种措施来选择首选源 IP 地址:如果用户明确配置了该路由,则它会返回首选源地址;否则,它通过调用匹配路由的范围 ( )来获取 1 。resfib_lookupFIB_RES_PREFSRCinet_select_addrres->scope

When the packet to route does not fall into the two special cases just listed, ip_route_output_slow selects the source IP address by calling FIB_RES_PREFSRC, passing to it the result res of the search made by fib_lookup for a route. FIB_RES_PREFSRC uses various measures to pick the preferred source IP address: it returns a preferred source address if one is explicitly configured for that route by the user; otherwise, it gets one by calling inet_select_addr with the scope of the matching route (res->scope).

ip_route_output_slow通过将其作为第一个输入参数传递给 ,为出口设备上配置的地址(如果该设备已知)提供更高的优先级inet_select_addr。然而,也可以选择其他设备的地址。

ip_route_output_slow gives higher priority to addresses configured on the egress device (if this device is known), by passing it as the first input parameter to inet_select_addr. However, other devices' addresses can be selected as well.

图 35-9总结了用于选择源 IP 地址的逻辑。

Figure 35-9 summarizes the logic used to select the source IP address.

源IP选择

图 35-9。源IP选择

Figure 35-9. Source IP selection

本地配送

Local Delivery

fib_lookup目标地址是本地配置的,或者没有提供目标地址时(即搜索包含未知地址 0.0.0.0),数据包将在本地传送。在这种情况下:

A packet is delivered locally when fib_lookup says the destination address is locally configured, or when no destination address is provided (i.e., the search contains the unknown address 0.0.0.0). In this case:

  • 出口设备设置为环回地址。这意味着这个数据包不会离开本地主机;数据包的传输会将其重新注入IP堆栈。

  • The egress device is set to the loopback address. This means that this packet will not leave the local host; the transmission of the packet will reinject it into the IP stack.

  • dst->input被初始化为,如第 20 章“本地传递ip_local_deliver部分所述。因此,当数据包被重新注入并调用时,该函数将处理该数据包。ip_rcv_finishdst_inputip_local_deliver

  • dst->input is initialized to ip_local_deliver, as described in the section "Local Delivery" in Chapter 20. Thanks to this, when the packet is reinjected and ip_rcv_finish invokes dst_input, the ip_local_deliver function will handle the packet.

图 35-10显示了当数据包从内核网络代码中的输出函数移动到输入函数时这两个操作的效果。

Figure 35-10 shows the effect of these two actions as the packet moves from the output functions to the input functions in the kernel network code.

处理本地生成和传送的数据包

图 35-10。处理本地生成和传送的数据包

Figure 35-10. Handling packets generated and delivered locally

当搜索关键字中未设置源 IP 地址和目标 IP 地址时,数据包将在本地传送,源地址和目标地址均设置为默认环回地址 127.0.0.1 (),其范围INADDR_LOOPBACKRT_SCOPE_HOST

When neither the source nor the destination IP address is set in the search key, the packet is delivered locally, with both source and destination addresses set to the default loopback address 127.0.0.1 (INADDR_LOOPBACK), which has scope RT_SCOPE_HOST.

传输到其他主机

Transmission to Other Hosts

与本地传送的数据包不同,那些要发送出去的数据包需要执行另外两项任务:

Unlike locally delivered packets, those that are to be transmitted out require the performance of two further tasks:

  • 当查找返回的路由是多路径路由时,该函数需要选择下一跳。这是由 处理的fib_select_multipath

  • When the route returned by the lookup is a multipath route, the function needs to select the next hop. This is taken care of by fib_select_multipath.

  • 当返回的路由是默认路由时,该函数需要选择要使用的默认网关。这是由 处理的fib_select_default。(默认路由由 0 字段指示res.prefixlen;这意味着“前缀长度”(与地址关联的网络掩码的长度)为 0。)

  • When the returned route is a default route, the function needs to select the default gateway to use. This is taken care of by fib_select_default. (The default route is indicated by a res.prefixlen field of 0; this means that the "prefix length," the length of the netmask associated with the address, is 0.)

以下各节将讨论这两项任务。

Both of these tasks are discussed in the following sections.

即使路由查找fib_lookup失败,也有可能成功传输数据包。当出口设备提供有搜索关键字时,ip_route_output_slow假设目的地在出口设备上可直接到达。在这种情况下,RT_SCOPE_LINK还会设置一个具有范围的源 IP 地址(如果尚未存在);如果可能的话,使用出口设备的地址。

Even when a route lookup with fib_lookup fails, it may be possible to successfully transmit a packet. When the egress device is provided with the search key, ip_route_output_slow assumes the destination is directly reachable on the egress device. In this case, a source IP address with scope RT_SCOPE_LINK is also set, if one is not already there; an address from the egress device is used, if possible.

多路径和默认网关选择之间的交互

Interaction Between Multipath and Default Gateway Selection

此快照ip_route_output_slow显示了何时调用两个关键函数fib_select_multipathfib_select_default来分别处理多路径和默认网关选择。res是返回的结果fib_lookup

This snapshot from ip_route_output_slow shows when the two key functions fib_select_multipath and fib_select_default are called to take care of respectively, multipath and default gateway selection. res is the result returned by fib_lookup.

#ifdef CONFIG_IP_ROUTE_MULTIPATH
    if (res.fi->fib_nhs > 1 && fl.oif == 0)
        fib_select_multipath(&fl, &res);
    别的
#万一
    if (!res.prefixlen && res.type == RTN_UNICAST && !fl.oif)
        fib_select_default(&fl, &res);
#ifdef CONFIG_IP_ROUTE_MULTIPATH
    if (res.fi->fib_nhs > 1 && fl.oif == 0)
        fib_select_multipath(&fl, &res);
    else
#endif
    if (!res.prefixlen && res.type == RTN_UNICAST && !fl.oif)
        fib_select_default(&fl, &res);

请注意,当搜索关键字指定要使用的出口设备 ( ) 时,不需要这两个例程fl.oif。在这种情况下,res已经包含最终的转发决策。因此, 执行的主要任务fib_lookup以及fib_semantic_match它调用的函数(见图 35-1)是选择:

Note that there is no need for these two routines when the search key specifies an egress device to use (fl.oif). In this case, res already contains the final forwarding decision. Therefore, the main tasks performed by fib_lookup, and the fib_semantic_match function it calls (see Figure 35-1), are to select:

  • 当匹配路由是多路径路由时,下一跳。通过选择与出口设备匹配的第一个下一跳路由器来实现此目的(请参阅“辅助标准的语义匹配fib_semantic_match”部分)。这是通过条件代码完成的,该条件代码仅在内核编译为支持多路径时才出现。

  • The next hop, when the matching route is a multipath route. fib_semantic_match accomplishes this by selecting the first next hop router that matches the egress device (see the section "Semantic Matching on Subsidiary Criteria"). This is done in conditional code that is present only when the kernel is compiled with multipath support.

  • 默认路由,当匹配的路由是默认路由时。fib_semantic_match通过选择与出口设备匹配的第一个默认路由来实现此目的。fib_semantic_match不区分具有不同网络掩码长度的路由,这意味着它不会特殊对待默认路由,因此这种情况由 透明地处理fib_semantic_match

  • The default route, when the matching route is a default route. fib_semantic_match accomplishes this by selecting the first default route that matches the egress device. fib_semantic_match does not differentiate between routes with different netmask lengths, which means it does not treat default routes specially, so this case is handled transparently by fib_semantic_match.

多路径在“多路径对下一跳选择的影响”部分中进行了描述;默认网关选择在“默认网关选择”部分中进行了描述。

Multipath is described in the section "Effects of Multipath on Next Hop Selection"; default gateway selection is described in the section "Default Gateway Selection."

如果断章取义,本节开头显示的代码片段可能会被误解,并可能导致两个误解:

The code snippet shown at the beginning of this section could be misinterpreted when taken out of context, and could lead to two misunderstandings:

  • 它表明 Multipath 不能用于默认路由,因为快照中的逻辑显示 的执行会fib_select_multipath阻止对以下fib_select_default函数的调用。

    然而,多路径实际上可以用于默认路由。IPROUTE2 软件包提供的ip命令(配置多路径所需)允许您配置具有多个下一跳的默认路由因此,调用fib_select_multipath该路由就足以完成路由决策。

    net-tools 的route工具允许管理员配置多个默认路由,每个路由都有一个下一跳。在这种情况下,多路径不在图中(fib_nhs始终为 1)。这样fib_select_default就足以完成路由决策了。

  • It suggests that Multipath cannot be used for default routes, because the logic in the snapshot shows that the execution of fib_select_multipath precludes a call to the following fib_select_default function.

    However, Multipath can actually be used on a default route. The ip command provided by the IPROUTE2 package (which is required for configuring Multipath) allows you to configure the default route with multiple next hops. Therefore, calling fib_select_multipath on this route is sufficient to complete the routing decision.

    net-tools's route tool allows an administrator to configure several default routes, each one with a single next hop. In this case, Multipath is not in the picture (fib_nhs is always 1). So fib_select_default is sufficient to complete the routing decision.

  • 这表明管理员不能在出口设备上配置多个下一跳,因为fib_select_multipath只有当出口设备为空时才会调用。

    但是,可以使用同一出口设备配置具有多个下一跳的多路径路由。其搜索键包含非空出口设备 ( fl.oif) 的路由查找由 处理 fib_semantic_match,它仅返回与该设备匹配的第一个可用下一跳。fib_select_multipath不参与选择。

  • It suggests that an administrator cannot configure multiple next hops on an egress device, because fib_select_multipath is called only when the egress device is null.

    However, it is possible to configure a multipath route with more than one next hop using the same egress device. A routing lookup whose search key contains a non-null egress device (fl.oif) is handled by fib_semantic_match, which simply returns the first available next hop that matches the device. fib_select_multipath is not involved in the selection.

默认网关选择

Default Gateway Selection

的选择 正确的默认网关是通过 完成的,当满足以下两个条件时fib_select_default调用它:ip_route_output_slow

The selection of the right default gateway is done with fib_select_default, which is invoked by ip_route_output_slow when both of the following conditions are met:

返回的路由fib_lookup具有 /0 网络掩码(res.prefixlen为 0)
The route returned by fib_lookup has a /0 netmask (res.prefixlen is 0)

默认路由与任何目标地址匹配,但由于网络掩码为 /0(可能的最短网络掩码),因此最后进行检查。如果配置的路由均不匹配目的地址,则仅匹配默认路由。但是,由于所有默认路由都会匹配,因此fib_lookup始终返回它检查的第一个路由。这就是为什么 fib_select_default被要求在可用的选项中做出最佳选择的原因。

A default route matches any destination address, but it is checked last thanks to having a netmask of /0, the shortest possible netmask. If none of the configured routes matches the destination address, only default routes will match. However, because all default routes would match, fib_lookup always returns the first one it checks. This is why fib_select_default is called to make the best choice among the available ones.

返回的路由fib_lookup类型为 RTN_UNICAST
The route returned by fib_lookup is of type RTN_UNICAST

本地路由、广播路由、组播路由不需要网关;它与他们一起使用甚至可以被认为是荒谬的。

Local routes, broadcast routes, and multicast routes do not need a gateway; its use with them could even be considered nonsensical.

正如我们在第 34 章“函数的双重定义”一节中提到的,有两个版本。这是不支持策略路由(在include/net/ip_fib.h [ * ]中定义)时使用的一种:fib_select_default

As we mentioned in the section "Double Definitions for Functions" in Chapter 34, there are two versions of fib_select_default. This is the one used when there is no support for Policy Routing (defined in include/net/ip_fib.h [*]):

静态内联
void fib_select_default(const struct flowi *flp, struct fib_result *res)
{
    if (FIB_RES_GW(*res) && FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK)
        ip_fib_main_table->tb_select_default(ip_fib_main_table, flp, res);
}
static inline
void fib_select_default(const struct flowi *flp, struct fib_result *res)
{
    if (FIB_RES_GW(*res) && FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK)
        ip_fib_main_table->tb_select_default(ip_fib_main_table, flp, res);
}

flp是搜索键,res是先前调用 fib_lookupin返回的查找结果ip_route_output_slow

flp is the search key, and res is the lookup result returned by a previous call to fib_lookup in ip_route_output_slow.

tb_select_default注意,当不满足执行所需的条件时,调用者不会收到任何错误或警告;只是返回作为输入提供的fib_select_default相同 实例。fib_result

Note that when the conditions required to execute tb_select_default are not met, the caller does not receive any error or warning; fib_select_default simply returns the same fib_result instance that was provided as input.

tb_select_default初始化为,它在net/ipv4/fib_hash.cfn_hash_select_default中定义,并在下一节中描述。请注意,仅当下一跳网关具有范围的路由时才进行查找 ;其原因在第 30 章“范围的使用” 部分中进行了描述。fib_select_defaultip_fib_main_tableresRT_SCOPE_LINK

tb_select_default is initialized to fn_hash_select_default, which is defined in net/ipv4/fib_hash.c and described in the following section. Note that fib_select_default does a lookup on the ip_fib_main_table only when res is a route whose next hop gateway has scope RT_SCOPE_LINK; the reason for this is described in the section "Use of the scope" in Chapter 30.

fn_hash_select_default 函数

fn_hash_select_default Function

fn_hash_select_default函数在输入中接收一个fib_result结构体 ,其中存储了res先前调用的结果。fib_lookup该结构用作 搜索默认路由的起点fn_hash_select_default

The fn_hash_select_default function receives in input a fib_result structure, res, where the result of a previous fib_lookup invocation was stored. This structure is used as the starting point for the search of the default route by fn_hash_select_default.

要被选中,默认路由必须具有与 相同的范围res->scope、小于或等于 的优先级 res->fi->fib_priority以及具有范围的下一跳RT_SCOPE_LINK(即必须直连)。

To be selected, the default route must have the same scope as res->scope, a priority that is less than or equal to res->fi->fib_priority, and a next hop with scope RT_SCOPE_LINK (i.e., it must be directly connected).

路由的选择还考虑了下一跳的可达状态。fib_detect_death用于为下一跳具有已解析为 L2 地址(即NUD_REACHABLE状态)的 L3 地址的路由提供更高的优先级。此检查可确保当当前使用的默认路由变得不可用时(例如,由于下一跳网关发生故障),会选择新的路由(如果可用)。

The selection of the route also takes into consideration the reachability status of the next hops. fib_detect_death is used to give higher preference to routes whose next hops have an L3 address that is already resolved to an L2 address (i.e., NUD_REACHABLE state). This check ensures that when the currently used default route becomes unusable—for example, because the next hop gateway failed—a new one is selected, if available.

先前选择的默认路由保存在全局变量中fn_hash_last_dflt

The previously selected default route is saved in the global variable fn_hash_last_dflt.

整个例程都是以fib_hash_lock 举行的方式进行的。

The entire routine runs with the fib_hash_lock held.

多路径对下一跳选择的影响

Effects of Multipath on Next Hop Selection

在 和 中 ip_route_input_slowip_route_output_slowfib_select_multipath 在以下情况下被调用:

In both ip_route_input_slow and ip_route_output_slow, fib_select_multipath is called only when:

  • 多路径支持包含在内核中 ( CONFIG_IP_ROUTE_MULTIPATH)。

  • Multipath support is included in the kernel (CONFIG_IP_ROUTE_MULTIPATH).

  • 路由查找fib_lookup返回具有多个下一跳 ( fib_nhs> 1) 的路由。

  • The routing lookup with fib_lookup returns a route with more than one next hop (fib_nhs> 1).

  • 出口接口未提供搜索键。

  • The egress interface was not provided with the search key.

  • 目标地址不是本地地址、广播地址或多播地址。

  • The destination address is not a local, broadcast, or multicast address.

以下代码显示了如何fib_select_multipath 调用选择下一跳:

The following code shows how fib_select_multipath is called to select the next hop:

#ifdef CONFIG_IP_ROUTE_MULTIPATH
    if (res.fi->fib_nhs > 1 && fl.oif == 0)
        fib_select_multipath(&key, &res);
#万一
#ifdef CONFIG_IP_ROUTE_MULTIPATH
    if (res.fi->fib_nhs > 1 && fl.oif == 0)
        fib_select_multipath(&key, &res);
#endif

我们已经在第 31 章的“下一跳选择”部分看到了当多个可用时 Linux 如何选择要使用的下一跳。现在让我们看看该算法是如何实现的。

We already saw in the section "Next Hop Selection" in Chapter 31 how Linux selects the next hop to use when more than one is available. Let's see now how that algorithm is implemented.

我们在第 34 章的“路由哈希表的组织”一节中看到 ,路由由紧密耦合的数据结构和来表示,并且每个数据结构都包含一个数据结构数组(一个数据结构对应于路由中指定的每个下一跳)。fib_nodefib_infofib_infofib_nh

We saw in the section "Organization of Routing Hash Tables" in Chapter 34 that a route is represented by the closely coupled data structures fib_node and fib_info, and that each fib_info includes an array of fib_nh data structures (one for each next hop specified in the route).

首先,让我们澄清fib_infofib_nh结构的哪些字段用于决定是否必须在可用下一跳池中选择下一跳,如果是,则选择哪一个。

First, let's clarify which fields of the fib_info and fib_nh structures are used to decide whether a next hop must be chosen among a pool of available next hops, and if so, which one is chosen.

这些是用于存储多路径配置的字段:

These are the fields used to store the multipath configuration:

fib_info->fib_nhs
fib_info->fib_nhs

路由定义的下一跳数。

Number of next hops defined by the route.

fib_info->fib_nh
fib_info->fib_nh

结构体数组fib_nh。数组的大小由 给出fib_info->fib_nhs

Array of fib_nh structures. The size of the array is given by fib_info->fib_nhs.

以下字段用于实现加权随机循环算法:

The following fields are used to implement the weighted random roundrobin algorithm:

fib_info->fib_power
fib_info->fib_power

fib_nh->nh_weight它被初始化为该实例的所有下一跳的权重总和 ( ) fib_info,不包括由于某些原因被禁用的那些(用标志标记RTNH_DEAD)。每次fib_select_multipath调用选择下一跳时,fib_power都会递减。当它的值达到零时,它的值会被重新初始化。

This is initialized to the sum of the weights (fib_nh->nh_weight) of all the next hops of this fib_info instance, excluding ones that are disabled for some reasons (tagged with the RTNH_DEAD flag). Every time fib_select_multipath is called to select a next hop, fib_power is decremented. Its value is reinitialized when it reaches zero.

fib_nh->nh_weight
fib_nh->nh_weight

下一跳的权重。当没有明确配置时,它被设置为默认值 1。正如我们将看到的,该值用于使fib_select_multipath选择下一跳与其权重成比例(相对于fib_info->fib_power)。

Weight of this next hop. When not explicitly configured, it is set to a default value of 1. As we will see, this value is used to make fib_select_multipath select the next hops proportional to their weights (relative to fib_info->fib_power).

fib_nh->nh_power
fib_nh->nh_power

允许选择下一跳的令牌。该值首先被初始化为 初始化fib_nh->nh_weight时的值。fib_info->fib_power每次选择下一跳时,其值都会递减fib_select_multipath。当该值达到零时,不再选择下一跳,直到nh_power重新初始化fib_nh->nh_weight(重新初始化时发生 fib_info->fib_power)。

Tokens allowing this next hop to be selected. This value is first initialized to fib_nh->nh_weight when fib_info->fib_power is initialized. Its value is decremented every time this next hop is selected by fib_select_multipath. When the value reaches zero, this next hop is no longer selected until nh_power is reinitialized to fib_nh->nh_weight (which happens when fib_info->fib_power is reinitialized).

现在我们来看看实现是如何工作的。

Now we'll look at how the implementation works.

一切都始于所有下一跳具有nh_power与其权重相同的标记数 ( )。正如我们所见,该数字默认为 1。该change_nexthops循环设置下一跳nh_power字段,同时在函数的局部变量 中累积总权重power

Everything starts with all the next hops having a number of tokens (nh_power) that is the same as their weights. This number, as we've seen, is 1 by default. The change_nexthops loop sets the next hops' nh_power field while accumulating the total weights in the function's local variable, power.

    spin_lock_bh(&fib_multipath_lock);
    if (fi->fib_power <= 0) {
        整数幂=0;
        更改_nexthops(fi){
            if (!(nh->nh_flags&RTNH_F_DEAD)) {
                功率 += nh->nh_weight;
                nh->nh_power = nh->nh_weight;
            }
        endfor_nexthops(fi);
    spin_lock_bh(&fib_multipath_lock);
    if (fi->fib_power <= 0) {
        int power = 0;
        change_nexthops(fi) {
            if (!(nh->nh_flags&RTNH_F_DEAD)) {
                power += nh->nh_weight;
                nh->nh_power = nh->nh_weight;
            }
        } endfor_nexthops(fi);

fib_info->fib_power被初始化为下一跳权重之和。因为每次fib_select_multipath做出决定时它都会递减(在本节后面显示的代码中),所以在达到 0 时,每个下一跳将被选择与其权重相等的次数。fib_power这也确保了在达到 0 时 fib_info->fib_power,每个下一跳都将被选择。下一跳已被选择的次数与其权重成正比。

fib_info->fib_power is initialized to the sum of the next hop's weight. Because it is decremented each time fib_select_multipath makes a decision (in code shown later in this section), each next hop will be selected a number of times equal to its weight by the time fib_power reaches 0. This also ensures that by the time fib_info->fib_power reaches 0, each next hop has been selected a number of times proportional to its weight.

        fi->fib_power = 功率;
        如果(功率 <= 0){
            spin_unlock_bh(&fib_multipath_lock);
            res->nh_sel = 0;
            返回;
        }
    }
        fi->fib_power = power;
        if (power <= 0) {
            spin_unlock_bh(&fib_multipath_lock);
            res->nh_sel = 0;
            return;
        }
    }

by 对下一跳的选择fib_select_multipath 是伪随机的:每次fib_select_multipath调用时,都会生成一个w从 0 到 的随机数fib_info->fib_power-1,然后浏览所有下一跳,直到找到令牌数 ( fib_nh->nh_power) 大于或等于 的下一跳w。请注意,w每个循环都会减少,从而使每个循环更有可能找到与此条件匹配的下一跳。

The selection of a next hop by fib_select_multipath is pseudorandom: every time fib_select_multipath is called, it generates a random number w ranging from zero to fib_info->fib_power-1, and then browses all the next hops until it finds one that has a number of tokens (fib_nh->nh_power) greater than or equal to w. Note that w is reduced at each loop, making each loop more likely to find a next hop that matches this condition.

    w = jiffies % fi->fib_power;
    更改_nexthops(fi){
        if (!(nh->nh_flags&RTNH_F_DEAD) && nh->nh_power) {
            if ((w -= nh->nh_power) <= 0) {
                nh->nh_power--;
                fi->fib_power--;
                res->nh_sel = nhsel;
                spin_unlock_bh(&fib_multipath_lock);
                返回;
            }
        }
    endfor_nexthops(fi);
    res->nh_sel = 0;
    spin_unlock_bh(&fib_multipath_lock);
    w = jiffies % fi->fib_power;
    change_nexthops(fi) {
        if (!(nh->nh_flags&RTNH_F_DEAD) && nh->nh_power) {
            if ((w -= nh->nh_power) <= 0) {
                nh->nh_power--;
                fi->fib_power--;
                res->nh_sel = nhsel;
                spin_unlock_bh(&fib_multipath_lock);
                return;
            }
        }
    } endfor_nexthops(fi);
    res->nh_sel = 0;
    spin_unlock_bh(&fib_multipath_lock);

多路径缓存

Multipath Caching

图 35-4显示了何时fib_select_multipath为入口和出口流量调用上一节中描述的例程,以及对多路径缓存的支持如何影响 和 填充路由缓存的ip_mkroute_input方式ip_mkroute_output。我们分别分析入口和出口的情况。

Figure 35-4 shows when the fib_select_multipath routine described in the previous section is called for both ingress and egress traffic, as well as how support for multipath caching influences the way the routing cache is populated by ip_mkroute_input and ip_mkroute_output. Let's analyze the ingress and egress cases separately.

入口流量
Ingress traffic

当内核不支持多路径缓存时,在满足前面列出的条件时ip_mkroute_input调用,并根据前面描述的逻辑选择下一跳。fib_select_multipath

当内核支持多路径缓存时,它不会选择带有fib_select_multipath. 相反,它会循环遍历多路径路由的所有下一跳,并为每个下一跳添加一个条目到缓存中。对于每条路由,它还调用,如第 33 章“路由缓存和多路径之间的接口multipath_set_nhinfo部分所述。缓存算法可以使用该函数来更新用于选择下一跳的本地信息。例如,加权随机算法使用该函数来填充其下一跳数据库(请参阅中的“加权随机算法” 部分)第33章)。

When the kernel does not have support for multipath caching, ip_mkroute_input calls fib_select_multipath when the conditions listed in the previous sections are met, and selects one next hop according to the logic described earlier.

When the kernel has support for multipath caching, it does not select one next hop with fib_select_multipath. Instead, it loops over all the next hops of the Multipath route and adds an entry to the cache for each one. For each route, it also calls multipath_set_nhinfo, described in the section "Interface Between the Routing Cache and Multipath" in Chapter 33. That function can be used by the caching algorithm to update the local information it uses to select the next hop. For example, the weighted random algorithm uses the function to populate its database of next hops (see the section "Weighted Random Algorithm" in Chapter 33).

出口流量
Egress traffic

如图 35-4所示,出口情况与入口情况非常相似。唯一的区别是,即使内核支持多路径缓存,fib_select_multipath也会调用 ,并且后一个调用会ip_mkroute_output覆盖 所做的选择 fib_select_multipath

As shown in Figure 35-4, the egress case is pretty similar to the ingress case. The only difference is that even when the kernel supports Multipath caching, fib_select_multipath is called and the latter invocation of ip_mkroute_output overrides the selection made by fib_select_multipath.

在这两种情况下res->nh_sel,即下一跳选择的结果,都被初始化为多路径路由的最后一个下一跳。对于后续数据包,选择将在缓存查找时完成。请参阅第 33 章中的“多路径缓存”部分。

In both cases—res->nh_sel, that is, the result of the next hop selection—is initialized to the last next hop of the multipath route. For subsequent packets, the selection will be done at cache lookup time. See the section "Multipath Caching" in Chapter 33.

策略路由

Policy Routing

支持策略路由的内核中的路由查找必须考虑可能存在的多个表。接下来的两节将展示策略路由版本fib_lookup和 与fib_select_default我们在本章前面看到的基本版本有何不同。

A routing lookup in a kernel that has support for Policy Routing has to take into account the possible presence of multiple tables. The next two sections show how the Policy Routing versions of fib_lookup and fib_select_default differ from the basic versions we saw earlier in this chapter.

fib_lookup 与策略路由

fib_lookup with Policy Routing

当配置策略路由时,该功能包含一个额外的步骤:它需要根据配置的策略找出要使用的路由表。

When Policy Routing is configured, this function contains an extra step: it needs to find out what routing table to use based on the configured policies.

我们在第32章“主要数据结构”一节中看到,路由策略是用数据结构定义的。所有实例都与全局列表链接在一起。该列表按照字段指示的升序进行排序。这允许配置定义检查规则的顺序,从而减少查找时间。更常见的匹配规则或最重要的规则(由管理员定义,取决于上下文)更靠近列表的头部。这是一个 32 位字段,这意味着一台主机理论上最多可以有 2 32 个fib_rulefib_rulefib_rulesprioritypriority政策。当然,由于策略存储在排序的平面列表中,因此大量策略可能会显着降低路由性能。

We saw in the section "Main Data Structures" in Chapter 32 that routing policies are defined with fib_rule data structures. All the fib_rule instances are linked together with the global list fib_rules. The list is kept sorted in increasing order as indicated by the priority field. This allows the configuration to define the order in which the rules should be checked, therefore reducing lookup time. The more commonly matched rules or most important rules (as defined by the administrator, depending on the context) are closer to the head of the list. The priority is a 32-bit field, which means a host can theoretically have up to 232 policies. Of course, because policies are stored in a sorted, flat list, a high number of policies can decrease routing performance significantly.

无需任何用户配置,包含net/ipv4/fib_rules.cfib_rules 中定义的三个默认实例,如图35-11所示:

Without any user configuration, fib_rules includes the three default instances defined in net/ipv4/fib_rules.c, as shown in Figure 35-11:

local_rule
local_rule

这是最高优先级的规则,因此位于列表的开头。它总是匹配,其目的是强制第一次查找在ip_fib_local_table路由表上。这是有道理的,因为发送到本地系统的数据包不需要任何进一步的路由决策。

This is the highest-priority rule and is therefore at the head of the list. It always matches, and its purpose is to force the first lookup to be on the ip_fib_local_table routing table. This makes sense because the packets addressed to the local system don't need any further routing decision.

main_rule
main_rule

这是要检查的第二个表(除非管理员在其间插入一些用户定义的表)并且始终匹配。它会导致对主路由表的搜索ip_fib_main_table

This is the second table to be checked (unless the administrator inserts some user-defined tables in between) and always matches as well. It causes a search on the main routing table ip_fib_main_table.

default_rule
default_rule

这是默认表,位于列表末尾。

This is the default table and is put at the end of the list.

默认规则

图 35-11。默认规则

Figure 35-11. Default rules

图 35-12显示了 实现的逻辑fib_lookup。它一项一项地浏览策略,直到找到与正在路由的数据包匹配的项,或者到达策略列表末尾但没有任何匹配项为止。当找到匹配的策略时,接下来的操作取决于策略类型(请参阅第 31 章中的“使用策略路由查找”部分)。特别是,策略操作 、和导致返回错误,其值可以由 的调用者用来生成适当的 ICMP 消息。策略操作导致查找 ,其中包括调用RTN_UNREACHABLERTN_BLACKHOLERTN_PROHIBITfib_lookupRTN_UNICASTtb_lookupfn_hash_lookup“表查找:fn_hash_lookup ”部分中描述的函数。该函数可以返回各种结果。除了其专门部分中已经描述的错误之外,值得注意的是:

Figure 35-12 shows the logic implemented by fib_lookup. It browses policies one by one until it either finds a match with the packet it is routing or gets to the end of the list of policies without any match. When a matching policy is found, the action that follows depends on the policy type (see the section "Lookup with Policy Routing" in Chapter 31). In particular, the policy actions RTN_UNREACHABLE, RTN_BLACKHOLE, and RTN_PROHIBIT lead to the return of an error, whose value may be used by the caller of fib_lookup to generate the appropriate ICMP message. The policy action RTN_UNICAST leads to a lookup with tb_lookup, which consists of a call to the fn_hash_lookup function described in the section "The Table Lookup: fn_hash_lookup." This function can return various results. Besides the errors already described in its dedicated section, it is interesting to note that:

  • 当查找成功时,res->r初始化为匹配策略。

  • When the lookup succeeds, res->r is initialized to the matching policy.

  • 当查找失败时,fib_lookup如果错误类型为 - ,则继续遍历策略EAGAINfn_hash_lookup这是因为当与is找到的匹配路由关联的操作类型时,会返回该错误(请参阅第 30 章中的“路由类型和操作RTN_THROW部分)。

  • When the lookup fails, fib_lookup continues its loop over the policies if the error type is -EAGAIN. This is because that error is returned when the action type associated with the matching route found by fn_hash_lookup is RTN_THROW (see the section "Route Types and Actions" in Chapter 30).

使用策略路由选择默认网关

Default Gateway Selection with Policy Routing

使用策略路由选择默认路由的工作方式与没有策略路由时相同。唯一的区别是 ,在net/ipv4/fib_rules.cfib_select_default中定义,使用匹配策略 ( ) 来标识要使用的路由表。res->r

The selection of a default route with Policy Routing works just the same as when there is no Policy Routing. The only difference is that fib_select_default, defined in net/ipv4/fib_rules.c, uses the matching policy (res->r) to identify the routing table to use.

void fib_select_default(const struct flowi *flp, struct fib_result *res)
{
    if (res->r && res->r->r_action == RTN_UNICAST &&
        FIB_RES_GW(*res) && FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK) {
        结构 fib_table *tb;
        if ((tb = fib_get_table(res->r->r_table)) != NULL)
            tb->tb_select_default(tb, flp, res);
    }
}
void fib_select_default(const struct flowi *flp, struct fib_result *res)
{
    if (res->r && res->r->r_action == RTN_UNICAST &&
        FIB_RES_GW(*res) && FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK) {
        struct fib_table *tb;
        if ((tb = fib_get_table(res->r->r_table)) != NULL)
            tb->tb_select_default(tb, flp, res);
    }
}

源路由

Source Routing

我们在第 18 章中看到IP 数据包可以进行源路由。因为这是由 IP 协议直接处理的,而不涉及路由子系统,所以本书中有关 IP 的部分对此进行了介绍。在这里,我们只对源路由的含义感兴趣 关于路由查找。

We saw in Chapter 18 that IP packets can be source routed. Because this is taken care of by the IP protocol directly without involving the routing subsystem, it is covered in the part of the book about IP. Here we are interested just in the implications of source routing on the routing lookups.

让我们使用第18章中的图18-1作为参考。当入口数据包到达 时,它会触发第一次路由查找。在没有源路由的情况下,这是唯一需要的路由查找。但是,在调用之前,它会检查 IP 标头是否指定源路由,如果是,则处理它。ip_rcv_finiship_rcv_finishdst_input

Let's use Figure 18-1 in Chapter 18 as a reference. When an ingress packet reaches ip_rcv_finish, it triggers the first routing lookup. In the absence of source routing, this is the only routing lookup needed. However, before ip_rcv_finish calls dst_input, it checks whether the IP header specifies source routing and, if so, takes care of it.

这里的源路由由 处理ip_options_rcv_srr。它从 IP 标头中提取要使用的下一跳,并使用 进行第二次路由查找ip_route_input。第二次查找将skb->dst用新的查找替换现有的查找。参见图35-13中的调用顺序。

Source routing here is handled by ip_options_rcv_srr. It extracts the next hop to use from the IP header and makes a second routing lookup with ip_route_input. This second lookup replaces the existing skb->dst with a newer one. See the sequence of calls in Figure 35-13.

当本地生成的流量携带源路由 IP 选项时,它仅触发一次路由查找,因为在查找之前选择了正确的下一跳(请参阅 ip_queue_xmit示例)。

When locally generated traffic carries the Source Routing IP option, it triggers only one routing lookup because the correct next hop is selected before the lookup (see ip_queue_xmit for an example).

fib_lookup 函数的策略路由版本

图 35-12。fib_lookup 函数的策略路由版本

Figure 35-12. Policy Routing version of fib_lookup function

入口流量的源路由

图 35-13。入口流量的源路由

Figure 35-13. Source routing for ingress traffic

策略路由和基于路由表的分类器

Policy Routing and Routing Table Based Classifier

我们在第 31 章的“基于路由表的分类器” 部分看到,实现网络 QoS 层的流量控制子系统可以根据路由子系统计算的标签对数据包进行分类。在同一部分中,我们了解了如何配置领域,以及用于从这些配置派生路由标记的逻辑。在本节中,我们将了解领域配置如何存储在路由表中以及路由代码如何计算路由标记。由于流量控制超出了本书的范围,因此我们不会介绍它如何使用路由标记。

We saw in the section "Routing Table Based Classifier" in Chapter 31 that the Traffic Control subsystem, which implements the network QoS layer, can classify packets based on a tag computed by the routing subsystem. In the same section, we saw how realms are configured, and the logic used to derive the routing tag from those configurations. In this section, we will see how the realm configuration is stored in the routing table and how the routing tag is computed by the routing code. Because Traffic Control is outside the scope of this book, we will not cover how it uses the routing tag.

存储领域

Storing the Realms

内核分别将策略和路由领域存储在fib_rule->r_tclassidfib_nh->nh_tclassid字段中。

The kernel stores the policy and route realms in the fib_rule->r_tclassid and fib_nh->nh_tclassid fields, respectively.

fib_rule->r_tclassid
fib_rule->r_tclassid

源域和目标域都是 8 位值(范围从 0 到 255),但它们在r_tclassid. 配置源领域时,它进入高 16 位,配置目标领域时,它进入低 16 位。见图35-14

r_tclassid 字段结构

图 35-14。r_tclassid 字段结构

Both the source and destination realms are 8-bit values (ranging from 0 to 255) but they are each assigned 16 bits within r_tclassid. When the source realm is configured, it goes into the higher 16 bits, and when destination realm is configured, it goes into the lower 16 bits. See Figure 35-14.

Figure 35-14. r_tclassid field structure

fib_nh->nh_tclassid
fib_nh->nh_tclassid

通常,仅使用目标领域来计算路由标记;根据目的地址选择匹配的路由。然而,正如我们在第 31 章“计算路由标记”一节中看到的,有时内核需要进行反向路径查找。发生这种情况时,路由的目标领域源自反向路由的源领域。是一个32位变量。nh_tclassid

Normally, only the destination realm is used to compute the routing tag; the matching route is selected based on the destination address. However, as we saw in the section "Computing the routing tag" in Chapter 31, sometimes the kernel needs to make a reverse path lookup. When that happens, the destination realm of a route is derived from the source realm of the reverse route. nh_tclassid is a 32-bit variable.

辅助例程

Helper Routines

在了解如何dst.tclassid初始化之前,让我们先看一下用于完成该任务的一些辅助例程:

Before seeing how dst.tclassid is initialized, let's look at a few helper routines that are used in accomplishing that task:

fib_rules_tclass
fib_rules_tclass

这用于r_tclassidfib_rule数据结构中检索字段。由于返回的结果包含指向匹配实例的 fib_lookup指针,因此对于在查找后提取匹配规则很有用。请注意,仅当内核支持策略路由时才会调用此函数,这使得结构有意义。fib_rulefib_rules_tclassfib_rule

This is used to retrieve the r_tclassid field from a fib_rule data structure. Because the result returned by fib_lookup includes a pointer to the fib_rule instance that matched, fib_rules_tclass is useful to extract the matching rule after a lookup. Note that this function is called only when there is support for Policy Routing in the kernel, which makes fib_rule structures meaningful.

fib_combine_itag
fib_combine_itag

图 35-15显示了该函数的逻辑,该函数用于在需要反向路径查找时帮助查找领域。

当策略路由未启用时,它只是交换源和目标路由领域。

当启用策略路由时,该功能将策略源领域( 图 35-15中的 S2 )作为目标领域。如果提供了目标路由领域 (D1),它还会将其作为源领域,否则将采用目标策略领域 (D2)。

结果在输入参数中返回itag,调用者在调用时将使用该输入参数(请参阅“计算路由标记rt_set_nexthop”部分)。

fib_validate_source该函数在反向路径查找后调用。fib_validate_source接收源 IP 地址和目标 IP 地址作为输入,交换它们,然后调用fib_lookup进行反向路径查找。fib_lookup因此,返回的结果 也交换了源领域和目标领域。因为领域字段是 16 位宽,并且返回的领域被fib_lookup交换, fib_combine_itag所以使用 16 位移位来调整所有内容。

Figure 35-15 shows the logic of this function, which is used to help find realms when a reverse path lookup is necessary.

When Policy Routing is not enabled, it simply swaps the source and destination route realms.

When Policy Routing is enabled, the function takes policy source realm (S2 in Figure 35-15) as the destination realm. It also takes the destination route realm (D1) as the source realm if it is provided, and takes the destination policy realm (D2) otherwise.

The result is returned in the input parameter itag, which will be used by the caller when invoking rt_set_nexthop (see the section "Computing the Routing Tag").

This function is called by fib_validate_source after a reverse path lookup. fib_validate_source receives the source and destination IP addresses as input, swaps them, and calls fib_lookup to do a reverse path lookup. The result returned by fib_lookup, therefore, also has the source and destination realms swapped. Because the realm fields are 16 bits wide and the realms returned by fib_lookup are swapped, fib_combine_itag uses 16-bit shifts to adjust everything.

set_class_tag
set_class_tag

给定一个路由(以及关联的dst_entry.tclassid)和一个先前由调用者初始化的标记, set_class_tag使用第二个参数来填充尚未在 中初始化的领域dst_entry.tclassid

Given a route (and therefore the associated dst_entry.tclassid) and a tag previously initialized by the caller, set_class_tag uses the second parameter to fill in the realms not already initialized in dst_entry.tclassid.

计算路由标签

Computing the Routing Tag

fib_combine_itag 函数

图 35-19。fib_combine_itag 函数

Figure 35-19. fib_combine_itag function

路由标签必须通过我们在本章前面看到的ip_route_input_slow和函数来计算。使用的逻辑在第 30 章ip_route_output_slow的“计算路由标签”部分中进行了描述 。

The routing tag has to be calculated by the ip_route_input_slow and ip_route_output_slow functions we saw earlier in this chapter. The logic used was described in the section "Computing the Routing Tag" in Chapter 30.

计算路由标记所需的信息是要skb路由的数据包和skb->dst 路由查找的结果。路由标记保存在skb->dst.tclassid. 一旦ip_route_input_slow成功ip_route_output_slow找到转发信息,它们就会初始化一个新的路由缓存条目,包括路由标记,并将其添加到缓存中。缓存条目初始化的一部分是用 完成的rt_set_nexthop,其中包括处理路由标记 dst_entry.tclassid图 35-4rt_next_hop显示了调用的确切时间。

The information required to compute a routing tag is the skb packet to route and the skb->dst result of the routing lookup. The routing tag is saved in skb->dst.tclassid. Once ip_route_input_slow and ip_route_output_slow have successfully found the forwarding information, they initialize a new routing cache entry, including the routing tag, and add it to the cache. Part of the cache entry initialization is done with rt_set_nexthop, which among other things takes care of the routing tag dst_entry.tclassid. Figure 35-4 shows exactly when rt_next_hop is called.

静态无效 rt_set_nexthop(结构 rtable *rt,结构 fib_result *res,u32 itag)
{
    结构 fib_info *fi = res->fi;
 
    如果(fi){
        …………
#ifdef CONFIG_NET_CLS_ROUTE
        rt->u.dst.tclassid = FIB_RES_NH(*res).nh_tclassid;
#万一
    }
    …………
 
#ifdef CONFIG_NET_CLS_ROUTE
#ifdef CONFIG_IP_MULTIPLE_TABLES
    set_class_tag(rt, fib_rules_tclass(res));
#万一
    set_class_tag(rt, itag);
#万一
    …………
}
static void rt_set_nexthop(struct rtable *rt, struct fib_result *res, u32 itag)
{
    struct fib_info *fi = res->fi;
 
    if (fi) {
        ... ... ...
#ifdef CONFIG_NET_CLS_ROUTE
        rt->u.dst.tclassid = FIB_RES_NH(*res).nh_tclassid;
#endif
    }
    ... ... ...
 
#ifdef CONFIG_NET_CLS_ROUTE
#ifdef CONFIG_IP_MULTIPLE_TABLES
    set_class_tag(rt, fib_rules_tclass(res));
#endif
    set_class_tag(rt, itag);
#endif
    ... ... ...
}

前面的快照显示,tclassid当内核支持基于路由表的分类器时,首先使用目标路由的领域进行初始化(否则就不需要这样做)。请注意,set_class_tag根据内核是否具有策略路由支持,使用不同的输入进行调用:

The preceding snapshot shows that tclassid is first initialized with the destination route's realm, when the kernel has support for the routing table based classifier (otherwise, there would be no need for that). Note that set_class_tag is called with different inputs based on whether the kernel has Policy Routing support:

具有策略路由支持
With Policy Routing support

dst.tclassid尚未初始化的组件是从策略域中填充的。

The components of dst.tclassid that are not yet initialized are filled in from the policy realms.

没有策略路由支持
Without Policy Routing support

dst.tclassid尚未初始化的组件使用itag调用者先前计算的输入参数填充:

  • ip_route_input_slow(称为 via )传递一个用 计算_ _mkroute_input的值 。itagfib_combine_itag

  • ip_route_output_slow(称为 via _ _mkroute_output)传递 0,因为它路由的数据包是本地生成的,因此内核不会使用任何反向查找来尝试填充缺失的领域。

The components of dst.tclassid that are not yet initialized are filled in using the input parameter itag previously computed by the caller:

  • ip_route_input_slow (called via _ _mkroute_input) passes a value of itag computed with fib_combine_itag.

  • ip_route_output_slow (called via _ _mkroute_output) passes 0, because the packets it routes are generated locally and therefore the kernel does not use any reverse lookup to try to fill in the missing realms.




[ * ]请注意,由于它们是内联例程,因此它们可以使用goto引用它们所属的慢速例程中定义的标签的语句。

[*] Note that since they are inline routines, they can use goto statements that refer to labels defined in the slow routines they are part of.

[ * ]首选源 IP 地址 用于本地生成的流量(即,其传输不受另一个数据包的接收触发或影响的数据包)可能不同。请参阅“选择源 IP 地址”部分。

[*] The preferred source IP address to use for traffic generated locally (i.e., packets whose transmission is not triggered or influenced by the reception of another packet) may be different. See the section "Selecting the Source IP Address."

[ ] RFC 1122 将其称为“特定目的地”。

[] RFC 1122 calls it the "specific destination."

[ * ]有一种过时的有限广播形式,由全零组成:0.0.0.0。

[*] There is an obsolete form of limited broadcast that consists of all zeros: 0.0.0.0.

[ * ]请注意,L3 到 L2 地址映射也是自动的,如第 26 章“特殊情况”部分中所述。

[*] Note that the L3-to-L2 address mapping is also automatic, as explained in the section "Special Cases" in Chapter 26.

[ ] 0.0.0.0 也是有限广播地址的过时形式,但 Linux 不支持这种形式。

[] 0.0.0.0 is also an obsolete form of a limited broadcast address, but Linux does not honor that form.

[ * ]有关何时可以与搜索关键字一起提供源 IP 地址的示例,请参阅“首选源地址选择”部分。

[*] See the section "Preferred Source Address Selection" for examples of when the source IP address may be provided with the search key.

[ ]我在“助手例程”部分介绍了这个例程。

[] I introduced this routine in the section "Helper Routines."

[ * ]有关其他定义,请参阅“使用策略路由选择默认网关”部分。

[*] See the section "Default Gateway Selection with Policy Routing" for the other definition.

第 36 章路由:其他主题

Chapter 36. Routing: Miscellaneous Topics

在前面的章节中,我们了解了各种路由是如何进行的功能的工作原理以及它们如何相互交互以及如何与其他内核子系统交互。在本章中,我们通过子系统如何与配置路由的用户空间命令交互的描述来结束本书的路由部分。我不会描述命令本身,因为管理超出了本书的范围。我们还将查看/proc目录中导出的可用于调整路由的各种文件。本章最后详细描述了第 32 章中介绍的数据结构。

In the previous chapters, we saw how the various routing features work and how they interact with each other and with other kernel subsystems. In this chapter, we conclude the routing part of the book with a description of how the subsystem interacts with the user-space commands that configure routing. I will not describe the commands themselves, because administration is outside the scope of this book. We will also look at the various files exported in the /proc directory that can be used to tune routing. The chapter concludes with a detailed description of the data structures already introduced in Chapter 32.

用户空间配置工具

User-Space Configuration Tools

可以使用net-tools和 IPROUTE2 软件包来配置路由,它们 ioctl分别使用 和 Netlink 接口。以下小节提供了有关这两个软件包的更多详细信息,但重点关注 IPROUTE2 的 ip命令,这是在 Linux 上配置路由的更新且更强大的方法。

Routing can be configured with both the net-tools and IPROUTE2 packages, which use ioctl and Netlink interfaces, respectively. The following subsections give more details on these two packages, but focus on the IPROUTE2's ip command, which is the newer and more powerful way to configure routing on Linux.

如果您了解它们的局限性并相应地使用它们,那么这两套工具可以毫无问题地共存。net-tools不允许您配置任何高级路由功能,例如多路径和策略路由;您也无法在net-tools实用程序显示的结果中看到这些功能。但是,路由配置 IPROUTE2 应用的向后兼容net-tools

The two sets of tools can coexist without problems, if you know their limitations and use them accordingly. net-tools does not allow you to configure any of the advanced routing features, such as Multipath and Policy Routing; nor can you see these features in the results displayed by net-tools' utilities. However, the routing configuration applied by IPROUTE2 is backward compatible with net-tools.

图 36-1总结了我们将在各小节中看到的内容。该图显示了两个内核接口用于操作路由表的主要函数,以及net-toolsioctl 使用的命令。(IPROUTE2 还允许您配置其他对象,例如策略规则,但为了简单起见,图中未显示这些对象。)

Figure 36-1 summarizes what we will see in the subsections. The figure shows the main functions used by the two kernel interfaces to manipulate routing tables, and the ioctl commands used by net-tools. (IPROUTE2 allows you to configure other objects too, such as policy rules, but these are not shown in the figure to keep it simple.)

有几点值得一提:

A few points are worth mentioning:

  • 这两个工具最终都使用相同的例程添加或删除路由:fn_hash_insertfn_hash_delete,我们在第 34 章中看到了这一点。

    基于 ioctl 与基于 Netlink 的路由表操作

    图 36-1。基于 ioctl 与基于 Netlink 的路由表操作

  • Both tools end up adding or removing routes using the same routines: fn_hash_insert and fn_hash_delete, which we saw in Chapter 34.

    Figure 36-1. ioctl- based versus Netlink-based routing table manipulation

  • 由于前面的一点,在调用公共例程之前,从两个用户空间工具接收到的输入必须保存在相同的数据结构中fn_hash_ xxx。由于这两个工具使用不同的消息类型与内核对话,并且 Netlink 是较新且首选的接口,因此命令的输入ioctl将转换为带有fib_convert_rtentry. 转换还负责解析请求——将用户输入的字符串转换为本章后面所示的内核数据结构——因此不需要显式调用解析例程(而是由例程inet_check_attr调用inet_rtm_ xxx) 。

  • Because of the previous point, the input received from the two user-space tools must be saved in the same data structures before invoking the common fn_hash_ xxx routines. Because the two tools use different message types to talk to the kernel, and because Netlink is the newer and preferred interface, the input from ioctl commands is converted to Netlink format with fib_convert_rtentry. The conversion also takes care of parsing the request—converting the string entered by the user into the kernel data structure shown later in this chapter—so there is no need for an explicit call to the parsing routine inet_check_attr (which is instead called by the inet_rtm_ xxx routines).

  • 锁用于序列化路由配置更改。图 36-1没有显示与这两个inet_rtm_ xxx例程相关的任何锁定,因为锁定是由路由 Netlink 套接字代码在调用它们之前获取的(rtnetlink_rcv详细信息请参见 参考资料)。

  • A lock is used to serialize routing configuration changes. Figure 36-1 does not show any locking associated with the two inet_rtm_ xxx routines, because the lock is acquired by the routing Netlink socket code before invoking them (see rtnetlink_rcv for details).

使用 IPROUTE2 配置路由

Configuring Routing with IPROUTE2

IPROUTE2 软件包附带了不同的工具。在本章中,我们对ip命令感兴趣,特别是它的两个子命令 ip Routeip Rule,分别用于操作路由和策略路由规则。

The IPROUTE2 package comes with different tools. In this chapter, we are interested in the ip command, and in particular in its two subcommands ip route and ip rule, used respectively to manipulate routes and policy routing rules.

IPROUTE2 不仅允许您添加和删除路由,还允许修改、追加和前置路由。这些并不代表额外的功能,而只是管理操作,可以让处理大型路由表时变得更轻松。

IPROUTE2 allows you not only to add and remove a route, but also to modify, append, and prepend routes. These do not represent extra features, but just management operations that can make life easier when dealing with big routing tables.

IPROUTE2用户命令与内核函数的对应关系

Correspondence between IPROUTE2 user commands and kernel functions

表 36-1和表36-2显示了 IPROUTE2 为主ip Routeip Rule命令设置的操作码和标志。了解这些将使浏览图 36-1中所示的例程以及表的“内核处理程序”列中列出的例程变得更加容易。“CLI 关键字”列包含命令行中触发正确操作的单词。

Tables 36-1 and 36-2 show the operation codes and flags set by IPROUTE2 for the main ip route and ip rule commands. Knowing these will make it easier to browse the routines shown in Figure 36-1 and listed in the "Kernel handler" columns of the tables. The "CLI keyword" column contains the word in the command line that triggers the proper operation.

表 36-1中的一个关键字 需要解释:flushiprouteflush命令允许管理员定义要删除的路由类型。通常,人们会刷新所有内容,但该命令允许管理员通过设备和目标网络等标准来限制刷新的路由。

One keyword in Table 36-1 requires an explanation: flush. The ip route flush command allows the administrator to define what kinds of routes to remove. Usually one would flush everything, but the command allows the administrator to restrict the routes flushed through criteria such as device and destination network.

内核没有用于刷新操作的处理程序。相反,IPROUTE2 发出list命令来获取路由表的副本,过滤掉与刷新条件不匹配的路由,然后RTM_DELROUTE为剩下的每个路由发出命令。这在小型设置中工作得很好,但在处理大型路由表时可能会带来巨大的开销。向内核发送刷新标准并让它负责过滤会更容易、更快捷。

The kernel does not have a handler for the flush operation. Instead, IPROUTE2 issues a list command to get a copy of the routing table, filters out routes that do not match the flush criteria, and then issues an RTM_DELROUTE command for each route left. This works fine in small setups, but can introduce significant overhead when dealing with big routing tables. It would have been easier and faster to send the kernel the flushing criteria and let it take care of the filtering.

表 36-1。IPROUTE2的iproute.c文件中do_iproute设置的参数

Table 36-1. Parameters set by do_iproute in IPROUTE2's iproute.c file

CLI 关键字

CLI keyword

手术

Operation

旗帜

Flags

内核处理程序

Kernel handler

add

add

RTM_NEWROUTE

RTM_NEWROUTE

NLM_F_EXCL NLM_F_CREATE

NLM_F_EXCL NLM_F_CREATE

inet_rtm_newroute

inet_rtm_newroute

change

change

RTM_NEWROUTE

RTM_NEWROUTE

NLM_F_REPLACE

NLM_F_REPLACE

inet_rtm_newroute

inet_rtm_newroute

replace

replace

RTM_NEWROUTE

RTM_NEWROUTE

NLM_F_CREATE

NLM_F_CREATE

NLM_F_REPLACE

NLM_F_REPLACE

inet_rtm_newroute

inet_rtm_newroute

prepend

prepend

RTM_NEWROUTE

RTM_NEWROUTE

NLM_F_CREATE

NLM_F_CREATE

inet_rtm_newroute

inet_rtm_newroute

append

append

RTM_NEWROUTE

RTM_NEWROUTE

NLM_F_CREATE

NLM_F_CREATE

NLM_F_APPEND

NLM_F_APPEND

inet_rtm_newroute

inet_rtm_newroute

test

test

RTM_NEWROUTE

RTM_NEWROUTE

NLM_F_EXCL

NLM_F_EXCL

inet_rtm_newroute

inet_rtm_newroute

delete

delete

RTM_DELROUTE

RTM_DELROUTE

没有任何

None

inet_rtm_delroute

inet_rtm_delroute

list/lst/show

list/lst/show

RTM_GETROUTE

RTM_GETROUTE

没有任何

None

inet_dump_fib

inet_dump_fib

get

get

RTM_GETROUTE

RTM_GETROUTE

NLM_F_REQUEST

NLM_F_REQUEST

inet_rtm_getroute

inet_rtm_getroute

flush

flush

RTM_GETROUTE

RTM_GETROUTE

没有任何

None

没有任何

None

表 36-2。IPROUTE2的iprule.c文件中do_iprule设置的参数

Table 36-2. Parameters set by do_iprule in IPROUTE2's iprule.c file

CLI 关键字

CLI keyword

手术

Operation

旗帜

Flag

内核处理程序

Kernel handler

add

add

RTM_NEWRULE

RTM_NEWRULE

没有任何

None

inet_rtm_newrule

inet_rtm_newrule

delete

delete

RTM_DELRULE

RTM_DELRULE

没有任何

None

inet_rtm_delrule

inet_rtm_delrule

list/lst/show

list/lst/show

RTM_GETRULE

RTM_GETRULE

没有任何

None

inet_dump_rules

inet_dump_rules

请注意,某些内核处理程序会处理“CLI 关键字”列中的多个用户命令类型。内核可以通过操作和标志参数的组合来区分不同的命令。

Note that some kernel handlers take care of more than one user command type from the "CLI keyword" column. The kernel can distinguish the different commands thanks to the combination of the operation and flags parameters.

如图36-1所示,操纵路由的内核处理程序是inet_rtm_newrouteinet_rtm_delroute,将在下一小节中进行描述。我将把它作为练习,看看表 36-2中的处理程序是如何实现的(使用第 35 章中的“策略路由” 部分作为参考)。

As shown in Figure 36-1, the kernel handlers that manipulate routes are inet_rtm_newroute and inet_rtm_delroute, described in the next subsection. I'll leave it as an exercise to see how the handlers in Table 36-2 are implemented (use the section "Policy Routing" in Chapter 35 as a reference).

对于对研究 IPROUTE2 实用程序代码本身感到好奇的读者, 图 36-2显示了此包中的文件和例程,它们负责解析各种ip 路由ip 规则命令的请求并将其发送到内核。例如,如果您键入命令 ip route add ... , ip.cmain中的例程将处理iproute.c中定义的 命令。因为操作是add,所以会用 处理命令。do_iproutedo_iprouteiproute_modify

For readers who are curious about investigating the IPROUTE2 utility code itself, Figure 36-2 shows the files and routines in this package that take care of parsing and sending the requests of the various ip route and ip rule commands to the kernel. For example, if you type the command ip route add ..., the routine main in ip.c would process the command with do_iproute defined in iproute.c. Because the operation is add, do_iproute would process the command with iproute_modify.

用于路由的IPROUTE2文件和函数

图 36-2。用于路由的IPROUTE2文件和函数

Figure 36-2. IPROUTE2 files and functions for routing

inet_rtm_newroute 和 inet_rtm_delroute 函数

inet_rtm_newroute and inet_rtm_delroute functions

当内核从 IPROUTE2 工具接收到用户请求时,这两个例程分别负责添加和删除路由,如图36-1表 36-1所示。

These two routines take care of adding and removing a route, respectively, when the kernel receives a user request from the IPROUTE2 tools, as shown in Figure 36-1 and Table 36-1.

这两个例程都用于inet_check_attr填充kern_rta结构,该结构存储解析用户命令输入的结果。的所有字段kern_rta都是指针:它们直接指向从用户空间接收的数据结构内的字段。NULL 指针表示关联字段尚未配置。

Both routines use inet_check_attr to fill in a kern_rta structure, which stores the results from parsing the input from the user command. All the fields of kern_rta are pointers: they point directly to fields inside the data structure received from user space. A NULL pointer means that the associated field has not been configured.

在本节中,我们将研究inet_rtm_newroute. 的操作inet_rtm_delroute是对称的。

In this section, we'll examine inet_rtm_newroute. The operation of inet_rtm_delroute is symmetrical.

int inet_rtm_newroute(结构 sk_buff *skb, 结构 nlmsghdr* nlh, void *arg)
{
    结构 fib_table * tb;
    struct rtattr **rta = arg;
    struct rtmsg *r = NLMSG_DATA(nlh);
 
    如果(inet_check_attr(r,rta))
        返回-EINVAL;
 
    tb = fib_new_table(r->rtm_table);
 
    如果(待定)
        返回 tb->tb_insert(tb, r, (struct kern_rta*)rta, nlh, &NETLINK_CB(skb));
    返回-ENOBUFS;
}
int inet_rtm_newroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
{
    struct fib_table * tb;
    struct rtattr **rta = arg;
    struct rtmsg *r = NLMSG_DATA(nlh);
 
    if (inet_check_attr(r, rta))
        return -EINVAL;
 
    tb = fib_new_table(r->rtm_table);
 
    if (tb)
        return tb->tb_insert(tb, r, (struct kern_rta*)rta, nlh, &NETLINK_CB(skb));
    return -ENOBUFS;
}

首先,该函数解析输入消息nlh并将 inet_check_attr结果存储在 中rta添加路由时,用户可以指定该路由应该进入哪个路由表。第31章的“策略路由背后的概念”部分更详细地描述了多个路由表的概念。如果指定的表尚不存在,则使用 来创建并初始化该表。现在有了对路由表的引用,该函数调用虚拟函数来执行插入操作。我们在“添加和删除路由” 部分中看到fib_new_tabletb_insert第34章调用,其内部结构在同一章的“添加路由”tb_insert部分fh_hash_insert描述。

First the function parses the input message nlh with inet_check_attr and stores the result in rta. When adding a route, the user can specify which routing table it should go in. The concept of multiple routing tables is described in higher detail in the section "Concepts Behind Policy Routing" in Chapter 31. If the specified table does not already exist, it is created and initialized with fib_new_table. Having the reference to the routing table now, the function calls the virtual function tb_insert to do the insertion. We saw in the section "Adding and Removing Routes" in Chapter 34 that tb_insert invokes fh_hash_insert, whose internals are described in the section "Adding a Route" in the same chapter.

使用 net-tools 配置路由

Configuring Routing with net-tools

net-tools包中的route命令在大多数Unix 系统都可用,并且是配置和转储路由表及其缓存内容的最常见方法。

The route command in the net-tools package is available in most Unix systems, and is the most common way to configure and dump the content of the routing table and its cache.

Route addRoute del命令分别向内核发送ioctl 命令SIOCADDRT和来添加和删除路由。SIOCDELRT然而,路由表和路由缓存的转储是以不同的方式完成的:route只是转储/proc/net/route/proc/net/rt_cache文件的内容。[ * ]

The route add and route del commands send the ioctl commands SIOCADDRT and SIOCDELRT, respectively, to the kernel to add and remove a route. The dump of the routing table and routing cache, however, is done in a different way: route simply dumps the contents of the /proc/net/route and /proc/net/rt_cache files.[*]

处理这两个ioctl命令的内核处理程序是在net/ipv4/fib_frontend.cip_rt_ioctl中定义的 。图 36-1显示了其部分内部结构。

The kernel handler that takes care of the two ioctl commands is ip_rt_ioctl, defined in net/ipv4/fib_frontend.c. Figure 36-1 showed part of its internals.

只有具有网络管理权限( CAP_NET_ADMIN) 的用户才能使用route 命令。调用capable用于强制执行此规则。[ * ]然后,由于携带要删除或添加的路由信息​​的数据结构位于用户空间,因此必须将其复制到内核空间的地址中copy_from_user

Only users with network administration privileges (CAP_NET_ADMIN) can use the route command. The call to capable is used to enforce this rule.[*] Then, because the data structure that carries the information about the route to delete or add is in user space, it has to be copied into an address in kernel space with copy_from_user.

        …………
        如果(!能够(CAP_NET_ADMIN))
            返回-EPERM;
        if (copy_from_user(&r, arg, sizeof(struct rtentry)))
            返回-EFAULT;
           …………
        ... ... ...
        if (!capable(CAP_NET_ADMIN))
            return -EPERM;
        if (copy_from_user(&r, arg, sizeof(struct rtentry)))
            return -EFAULT;
           ... ... ...

变更通知

Change Notifications

我们在第 3 章中看到,Netlink 定义了多播组,其目的是发送有关特定类型事件的通知,并且用户程序可以注册成为这些组的一部分。这些组中的一个RTMGRP_IPV4_ROUTE组用于有关 IPv4 路由表更改的通知。RTGRM_IPV4_ROUTE这些更改将通过例程发送到多播组rtmsg_fib

We saw in Chapter 3 that Netlink defines multicast groups for the purpose of sending out notifications about particular kinds of events, and user programs can register to be part of those groups. Among those groups is the RTMGRP_IPV4_ROUTE group, which is used for notifications regarding changes to the IPv4 routing tables. These changes are sent to the multicast group RTGRM_IPV4_ROUTE with the rtmsg_fib routine.

对这些事件感兴趣的侦听器的示例是路由守护程序,它需要知道诸如其他守护程序或手动用户配置何时添加或删除路由之类的事情。用户还可以使用IPROUTE2的ip Monitor Route命令来测试该功能。图 36-3显示了一个示例:每次对一个终端上的路由表进行更改时,都会在执行ip Monitor Route命令的另一终端上打印一条通知 。[ * ]终端和内核通过 Netlink 套接字进行通信。

Examples of interested listeners for these events are routing daemons, which need to know such things as when routes are added or deleted by other daemons or by manual user configuration. Users can also use IPROUTE2's ip monitor routecommand to test the feature. Figure 36-3 shows an example: every time a change is applied to a routing table on one terminal, a notification is printed on the other terminal where the ip monitor route command is executing.[*] The terminal and the kernel communicate via a Netlink socket.

ip监视路由命令的使用示例

图 36-3。ip监视路由命令的使用示例

Figure 36-3. Example of use of the ip monitor route command

内核插入的路由:fib_magic 函数

Routes Inserted by the Kernel: The fib_magic Function

我们在图 36-1中看到,Netlink 套接字可用于在内核和用户空间之间交换消息。然而,在某些情况下,内核的不同部分使用 Netlink 消息相互通信。例如,这使得可以很容易地使用通常用于响应用户生成的事件的相同代码来响应内核生成的事件。

We saw in Figure 36-1 that the Netlink socket can be used to exchange messages between the kernel and user space. There are cases, however, where different parts of the kernel use Netlink messages to communicate with each other. This makes it easy, for instance, to react to kernel-generated events with the same code that is normally used to react to user-generated events.

例如,我们在第32章“添加IP地址”一节中看到,当在接口上配置新地址时,会生成一组路由 可以生成条目。安装这些路由的一个简单方法是模拟接收请求插入新路由的用户空间命令。这是通过例程完成的,该例程创建的消息与使用route addip route addfib_magic命令显式输入路由时生成的消息相同。

For instance, we saw in the section "Adding an IP address" in Chapter 32 that when a new address is configured on an interface, a set of routing entries may be generated. An easy way to install those routes is to simulate the reception of a user-space command that requests the insertion of new routes. This is accomplished with the fib_magic routine, which creates the same message that would have been generated if the route was entered explicitly with the route add or ip route add command.

fib_add_ifaddrfib_del_ifaddr是使用的两个很好的例子请参阅第 32 章中的“ IP 配置的更改fib_magic.部分,了解有关这两个功能的更多详细信息。

fib_add_ifaddr and fib_del_ifaddr are two good examples of the use of fib_magic. See the section "Changes in IP Configuration" in Chapter 32 for more details on those two functions.

统计数据

Statistics

路由代码保存统计信息关于路由代码的不同方面,例如查找和垃圾收集。统计信息是按每个处理器维护的。,在第 32 章“路由子系统初始化ip_rt_init一节中描述,为每个 CPU 分配一个数据结构的副本,其中 CPU 保存自己的统计信息。这些字段随宏递增,从而透明地更新正确 CPU 的计数器。“ rt_cache_stat结构”部分详细描述了各个字段。rt_cache_statrt_cache_statRT_CACHE_STAT_INCrt_cache_stat

The routing code keeps statistics about different aspects of the routing code, such as lookups and garbage collection. Statistics are maintained on a per-processor basis. ip_rt_init, described in the section "Routing Subsystem Initialization" in Chapter 32, allocates for each CPU a copy of the rt_cache_stat data structure, where the CPU keeps its own statistics. The rt_cache_stat fields are incremented with the RT_CACHE_STAT_INC macro, which transparently updates the counter for the right CPU. The section "rt_cache_stat Structure" describes the fields of rt_cache_stat in detail.

这些统计信息的内容可以通过转储/proc/net/stat/rt_cache文件的内容来读取(请参阅“ /proc/net 和 /proc/net/stat 目录”部分)。但是,您得到的输出的格式不便于阅读。要获得格式化输出,可以使用IPROUTE2 包附带的lnstat工具。

The content of these statistics can be read by dumping the content of the /proc/net/stat/rt_cache file (see the section "The /proc/net and /proc/net/stat Directories"). The output you would get, however, is not formatted for easy reading. To get formatted output, you can use the lnstat tool that comes with the IPROUTE2 package.

通过 /proc 文件系统进行调整

Tuning via /proc Filesystem

IPv4路由子系统使用/proc文件系统以只读模式导出一些内部数据结构(例如缓存),以及以读写模式导出其他结构,以便将它们用于调优。

The IPv4 routing subsystem uses the /proc filesystem to export some internal data structures in read-only mode (e.g., the cache), and other structures in read-write mode so that they can be used for tuning.

图 36-4显示了这些文件的位置以及注册它们的例程。显示的未引用创建例程的文件是sysctl_init 在启动时静态定义的。

Figure 36-4 shows where these files are located and the routines that register them. The files shown without a reference to a creating routine are statically defined by sysctl_init at boot time.

/proc/sys/net/ipv4/
/proc/sys/net/ipv4/

/proc/sys/net/ipv4/conf
/proc/sys/net/ipv4/conf

/proc/sys/net/ipv4/路由
/proc/sys/net/ipv4/route

这些目录用于导出用于调优的内部数据结构。因此,这些目录中的文件是可写的。后面的部分列出了它们的文件、关联的内核变量以及变量的默认值(如果适用)。[ * ]

These directories are used to export internal data structures used for tuning. The files in these directories are therefore writable. Later sections list their files, the associated kernel variables, and the variables' default values when applicable.[*]

/进程/网络/
/proc/net/

/proc/net/stat
/proc/net/stat

这些目录中的文件不用于调整,而是用于执行内核例程以获取某种信息。请参阅“ /proc/net 和 /proc/net/stat 目录”部分。

Files in these directories are not used for tuning, but rather, to execute kernel routines to get some kind of information. See the section "The /proc/net and /proc/net/stat Directories."

/proc IPv4 路由子系统使用的文件

图 36-4。/proc IPv4 路由子系统使用的文件

Figure 36-4. /proc files used by the IPv4 routing subsystem

/proc/sys/net/ipv4 目录

The /proc/sys/net/ipv4 Directory

该目录包含很多文件,但路由子系统使用的唯一文件是:

This directory contains a lot of files, but the only ones used by routing subsystems are:

ip转发
ip_forward

包含一个布尔标志,可用于全局启用和禁用 IP 转发。它的值可以根据每个设备进行覆盖(请参阅“启用和禁用转发”部分)。

Contains a Boolean flag that can be used to globally enable and disable IP forwarding. Its value can be overwritten on a per-device basis (see the section "Enabling and Disabling Forwarding").

icmp_echo_ignore_broadcasts
icmp_echo_ignore_broadcasts

ICMP 调整参数。它在第30章的“定向广播”一节中介绍过,其中解释了路由代码使用它来决定如何处理定向广播。广播过滤只能在此处启用和禁用,并且只能全局启用和禁用(不能基于每个设备)。

An ICMP tuning parameter. It was introduced in the section "Directed Broadcasts" in Chapter 30, which explained that the routing code uses it to decide how to handle directed broadcasts. Broadcast filtering can be enabled and disabled only here, and only globally (not on a per-device basis).

有关这些文件的摘要,请参阅表 36-3 。

See Table 36-3 for a summary of these files.

表 36-3。/proc/sys/net/ipv4/ 可用于调整路由子系统的文件

Table 36-3. /proc/sys/net/ipv4/ files usable for tuning the routing subsystem

内核变量名

Kernel variable name

文件名

Filename

默认值

Default value

ipv4_devconf.forwarding [ * ]

ipv4_devconf.forwarding [*]

ip_forward

ip_forward

0

0

sysctl_icmp_echo_ignore_broadcasts

sysctl_icmp_echo_ignore_broadcasts

icmp_echo_ignore_broadcasts

icmp_echo_ignore_broadcasts

0

0

[ * ]有关数据结构的描述,请参阅第 19 章ipv4_devconf

[*] See Chapter 19 for a description of the ipv4_devconf data structure.

/proc/sys/net/ipv4/route 目录

The /proc/sys/net/ipv4/route Directory

IPv4 路由子系统使用此目录中的所有文件。以下是文件的描述,按功能分组:

The IPv4 routing subsystem uses all the files in this directory. Here is a description of the files, grouped by functionality:

错误_突发
error_burst

错误成本
error_cost

用于实现ICMP_UNREACHABLE消息的速率限制。请参阅第 35 章中的“路由失败”部分。

Used to implement rate limiting for ICMP_UNREACHABLE messages. See the section "Routing Failure" in Chapter 35.

最大尺寸
max_size

GC阈值
gc_thresh

gc_min_interval
gc_min_interval

垃圾回收超时
gc_timeout

gc_弹性
gc_elasticity

垃圾回收间隔
gc_interval

由路由缓存垃圾收集算法使用,在第33章中描述。

Used by the routing cache garbage collection algorithm, described in Chapter 33.

冲水
flush

最短延迟
min_delay

最大延迟
max_delay

用于控制路由缓存的刷新。

与该目录中的其他文件不同,flush只能写[ * ]并触发操作;它不是一个简单的调整参数。当用户写入n此文件时,ipv4_sysctl_rtcache_flush将调用该函数以安排在几秒钟后刷新路由表缓存n。当负值写入刷新时,内核会在默认延迟后安排刷新min_delaymax_delay是用户安排刷新和内核实际刷新缓存之间可以经过的最长时间。请参阅中的“刷新路由缓存” 部分第33章

Used to control the flushing of the routing cache.

Unlike the other files in this directory, flush is only writable[*] and triggers an action; it is not a simple tuning parameter. When the user writes n into this file, the function ipv4_sysctl_rtcache_flush is invoked to schedule a flush of the routing table cache after n seconds. When a negative value is written to flush, the kernel schedules a flush after the default delay min_delay. max_delay is the maximum time that can pass between when the user schedules a flush and when the kernel actually flushes the cache. See the section "Flushing the Routing Cache" in Chapter 33.

最小广告时间
min_adv_mss

该值与 TCP 最大段大小 (MSS) 参数关联。每条路由都有一个关联的 MSS 值。当 a 的下一跳dst_entry被初始化(使用rt_set_nexthop)时,在将其添加到路由表缓存之前rt_intern_hash,MSS 被初始化为传出设备的 MTU 或min_adv_mss,以较大者为准。请参阅 中的注释tcp_advertise_mss及其初始化rt_set_nexthop

This value is associated with the TCP Maximum Segment Size (MSS) parameter. Each route has an associated MSS value. When the next hop of a dst_entry is initialized (with rt_set_nexthop), before it is added to the routing table cache with rt_intern_hash, the MSS is initialized to either the outgoing device's MTU or min_adv_mss, whichever is greater. See the comment in tcp_advertise_mss and its initialization in rt_set_nexthop.

最小PMTU
min_pmtu

mtu_过期
mtu_expires

当与路由缓存条目关联的 PMTU 发生更改时,路由缓存将在几秒钟后过期mtu_expires请参阅第 30 章中的“可使缓存条目过期的事件示例”部分。

min_pmtu是路径MTU发现协议可以为路由设置的最小PMTU值。

When the PMTU associated with a routing cache entry is changed, the routing cache is scheduled to expire after mtu_expires seconds. See the section "Examples of events that can expire cache entries" in Chapter 30.

min_pmtu is the minimum PMTU value that the path MTU discovery protocol can set for a route.

重定向加载
redirect_load

重定向号码
redirect_number

重定向_沉默
redirect_silence

用于实现ICMP_REDIRECT消息的速率限制。请参阅第 33 章中的“出口 ICMP 重定向速率限制”部分。

Used to implement rate limiting for ICMP_REDIRECT messages. See the section "Egress ICMP REDIRECT Rate Limiting" in Chapter 33.

秘密间隔
secret_interval

secret_interval路由缓存每隔/秒定期刷新一次HZ请参阅第 33 章中的“刷新路由缓存”部分。

The routing cache is flushed regularly every secret_interval/HZ seconds. See the section "Flushing the Routing Cache" in Chapter 33.

表 36-4列出了内核变量和默认值。[ * ]

Table 36-4 lists the kernel variables and default values.[*]

表 36-4。/proc/sys/net/ipv4/route 文件可用于调整路由子系统

Table 36-4. /proc/sys/net/ipv4/route files usable for tuning the routing subsystem

内核变量名

Kernel variable name

文件名

Filename

默认值

Default value

aflush请参阅本节前面的描述。

a See the description of flush earlier in this section.

b还有另一个文件 gc_min_interval与同一内核变量关联。该文件已被弃用并将被删除。

b There is another file, gc_min_interval, associated with the same kernel variable. That file is deprecated and will be removed.

c在启动时根据哈希表大小进行初始化,其值取决于安装的 RAM 量。

c Initialized at boot time based on the hash table size, whose value depends on the amount of RAM installed.

d 512 是 TCP 根据 RFC 793 和 1112 使用的默认 MSS;额外的 20+20 是 IP 和 TCP 标头的大小(当它们都没有选项时)。

d 512 is the default MSS used by TCP according to RFCs 793 and 1112; the additional 20+20 is the size of the IP and TCP headers, when they are both without options.

ip_rt_error_burst

ip_rt_error_burst

error_burst

error_burst

5*赫兹

5 * HZ

ip_rt_error_cost

ip_rt_error_cost

error_cost

error_cost

赫兹

HZ

flush_delay

flush_delay

flush

flush

适用

N/Aa

ip_rt_gc_elasticity

ip_rt_gc_elasticity

gc_elasticity

gc_elasticity

8

8

ip_rt_gc_interval

ip_rt_gc_interval

gc_interval

gc_interval

60*赫兹

60 * HZ

ip_rt_gc_min_interval

ip_rt_gc_min_interval

gc_min_interval_ms

gc_min_interval_ms b

赫兹/2

HZ / 2

ipv4_dst_ops.gc_thresh

ipv4_dst_ops.gc_thresh

gc_thresh

gc_thresh

取决于 RAM c

Depends on RAMc

ip_rt_gc_timeout

ip_rt_gc_timeout

gc_timeout

gc_timeout

RT_GC_TIMEOUT(300 * 赫兹)

RT_GC_TIMEOUT (300 * HZ)

ip_rt_min_delay

ip_rt_min_delay

min_delay

min_delay

2*赫兹

2 * HZ

ip_rt_max_delay

ip_rt_max_delay

max_delay

max_delay

10*赫兹

10 * HZ

ip_rt_max_size

ip_rt_max_size

max_size

max_size

取决于内存

Depends on RAM

ip_rt_min_advmss

ip_rt_min_advmss

min_adv_mss

min_adv_mss

256

256

ip_rt_min_pmtu

ip_rt_min_pmtu

min_pmtu

min_pmtu

512+20+20

512+20+20d

ip_rt_mtu_expires

ip_rt_mtu_expires

mtu_expires

mtu_expires

10*60*赫兹

10 * 60 * HZ

ip_rt_redirect_load

ip_rt_redirect_load

redirect_load

redirect_load

赫兹/50

HZ / 50

ip_rt_redirect_number

ip_rt_redirect_number

redirect_number

redirect_number

9

9

ip_rt_redirect_silence

ip_rt_redirect_silence

redirect_silence

redirect_silence

((赫兹/50)<<(9+1))

((HZ/50)<<(9+1))

ip_rt_secret_interval

ip_rt_secret_interval

secret_interval

secret_interval

10*60*赫兹

10 * 60 * HZ

/proc/sys/net/ipv4/conf 目录

The /proc/sys/net/ipv4/conf Directory

此目录包含可用于调整 IPv4、IPsec 和 ARP 协议以及基于每个设备控制路由的文件。与协议相关的参数已在相关章节中介绍,因此在本章中,我们将仅介绍用于调整路由的参数。

This directory includes files that can be used to tune the IPv4, IPsec, and ARP protocols, as well as to control routing on a per-device basis. The protocol-related parameters are covered in the associated chapters, so in this chapter, we will cover only the ones used to tune routing.

/ proc/sys/net/ipv4/conf目录包含每个已注册网络设备的子目录,其中包括环回设备,而环回设备又包含每个调整参数的文件。这允许您为每个设备配置前面提到的协议的路由参数。每个目录都包含相同的文件集。所有参数均按内核分组在类型为 的数据结构中,在include/linux/inetdevice.cipv4_devconf中定义 并如表 36-5所示。默认值是导出到/proc/sys/net/ipv4/conf/default目录下相应文件中的值。

The /proc/sys/net/ipv4/conf directory includes subdirectories for each registered network device, including the loopback device, which in turn contain files for each tuning parameter. This allows you to configure the routing parameters on a per-device basis for the previously mentioned protocols. Each directory contains the same set of files. All the parameters are grouped by the kernel in a data structure of type ipv4_devconf, defined in include/linux/inetdevice.c and shown in Table 36-5. The default value is the value exported to the corresponding file in the /proc/sys/net/ipv4/conf/default directory.

表 36-5。/proc/sys/net/ipv4/conf 子目录文件可用于调整路由子系统

Table 36-5. /proc/sys/net/ipv4/conf subdirectory files usable for tuning the routing subsystem

内核变量名(ipv4_devconf的字段)

Kernel variable name (field of ipv4_devconf)

文件名

Filename

默认值

Default value

accept_redirects

accept_redirects

accept_redirects

accept_redirects

1

1

accept_source_route

accept_source_route

accept_source_route

accept_source_route

1

1

forwarding

forwarding

forwarding

forwarding

0

0

mc_forwarding

mc_forwarding

mc_forwarding

mc_forwarding

0

0

rp_filter

rp_filter

rp_filter

rp_filter

0

0

secure_redirects

secure_redirects

secure_redirects

secure_redirects

1

1

shared_media

shared_media

shared_media

shared_media

1

1

send_redirects

send_redirects

send_redirects

send_redirects

1

1

log_martians

log_martians

log_martians

log_martians

0

0

tag(不曾用过)

tag(not used)

tag

tag

0

0

特殊子目录

Special subdirectories

除了每个设备的目录之外,/proc/sys/net/ipv4/conf目录还包括两个特殊目录:

In addition to a directory for every device, the /proc/sys/net/ipv4/conf directory includes two special directories:

默认
default

所有用户未显式配置的参数都将初始化为该目录中导出的默认值。这些值由内核在单独的ipv4_devconf 实例中维护ipv4_devconf_dflt(参见表36-5)。

All the parameters not explicitly configured by the user are initialized to the default values exported in this directory. These values are maintained by the kernel in a separate ipv4_devconf instance, ipv4_devconf_dflt (see Table 36-5).

全部
all

该目录用于全局配置(即用户在此处写入的内容适用于所有设备)。这些值也由内核维护在一个单独的数据结构中,该数据结构的名称与结构类型本身相同, ipv4_devconf

This directory is used for global configurations (i.e., what the user writes here applies to all devices). These values are also maintained by the kernel in a separate data structure whose name is the same as the structure type itself, ipv4_devconf.

默认目录每个设备目录都是通过调用创建的。所有devinet_sysctl_register.目录都是静态定义的(请参阅net/ipv4/devinet.c中 的定义)。devinet_sysctl_table

Both the default and per-device directories are created by calling devinet_sysctl_register. The all directory is statically defined (see the definition of devinet_sysctl_table in net/ipv4/devinet.c).

devinet_sysctl_registerdevinet_init当路由代码在引导时初始化时调用(请参阅第 32 章中的“路由子系统初始化”部分)以注册默认目录。由于该函数由 调用,因此每个设备都会调用一次该函数(当在设备上配置第一个 IPv4 地址时)。inetdev_init

devinet_sysctl_register is called by devinet_init when the routing code gets initialized at boot time (see the section "Routing Subsystem Initialization" in Chapter 32) to register the default directory. Because the function is called by inetdev_init, it is called once for each device (when the first IPv4 address is configured on the device).

特殊子目录的使用

Use of the special subdirectories

在组合每个设备和全局配置值时,以及将更改传播到all目录中导出的变量时,不同功能的行为有所不同。例如:

Different features behave differently when combining the per-device and global configuration values, as well as when propagating the changes to the variables exported in the all directory. For example:

  • 对于某些字段,每个设备的值和全局值进行 AND 运算。在这种情况下,仅当全局配置和每设备配置均启用时才会启用该功能。

  • For some of the fields, the per-device and global values are ANDed. In this case, the feature is enabled only if both the global and per-device configurations are enabled.

  • 对于某些字段,值经过 OR 运算。在这种情况下,启用任一文件中的值就足够了。

  • For some of the fields, the values are ORed. In this case, enabling the value in either file is sufficient.

  • 对于某些字段,不考虑全局值。

  • For some of the fields, the global values are not taken into consideration.

如何针对给定功能查阅文件取决于该功能的意义。

How the files are consulted for a given feature depends on what makes sense for that feature.

对于每个参数,都有一个IN_DEV_ XXXinclude/linux/inetdevice.h中定义的宏,可用于导出给定设备的当前操作状态。这些宏将设备的 IPv4 配置块作为输入参数,该配置块是in_device. 您可以查看这些宏来找出每个参数使用什么标准(AND、OR 或 NONE)来组合每个设备和全局配置。以下是这三种情况的示例:

For each parameter, there is a macro, IN_DEV_ XXX, defined in include/linux/inetdevice.h, that can be used to derive the current operative state for a given device. The macros take as their input parameter the IPv4 configuration block of the device, which is an instance of in_device. You can look at those macros to figure out what criteria (AND, OR, or NONE) each parameter uses to combine the per-device and global configuration. Here is an example for each of the three cases:

#定义 IN_DEV_RPFILTER(in_dev) \
(ipv4_devconf.rp_filter && (in_dev)->cnf.rp_filter)
 
#定义 IN_DEV_PROXY_ARP(in_dev) \
(ipv4_devconf.proxy_arp || (in_dev)->cnf.proxy_arp)
 
#define IN_DEV_MEDIUM_ID(in_dev) ((in_dev)->cnf.medium_id)
#define IN_DEV_RPFILTER(in_dev) \
(ipv4_devconf.rp_filter && (in_dev)->cnf.rp_filter)
 
#define IN_DEV_PROXY_ARP(in_dev) \
(ipv4_devconf.proxy_arp || (in_dev)->cnf.proxy_arp)
 
#define IN_DEV_MEDIUM_ID(in_dev) ((in_dev)->cnf.medium_id)

前面的示例使用的逻辑并不是宏实现的唯一逻辑 IN_DEV_ XXX。例如,IN_DEV_RX_REDIRECTS更复杂,被定义为多个参数的包装器,而不仅仅是两个值之间的 AND 或 OR 条件。

The logic used by the preceding examples is not the only one implemented by the IN_DEV_ XXX macros. For example, IN_DEV_RX_REDIRECTS is more complex and is defined as a wrapper around several parameters, not just as an AND or OR condition between two values.

还有一种情况需要考虑。对于某些参数,对所有目录中的文件的更改会立即传播到每个设备的目录(而不是由IN_DEV_ XXX宏查阅)。在这种情况下,关联的IN_DEV_ XXX宏不需要检查全局值。有关示例,请参阅“启用和禁用转发”部分。

There is one more case to consider. For some parameters, changes to the files in the all directory are propagated to the per-device directories right away (instead of being consulted by the IN_DEV_ XXX macros). In that case, the associated IN_DEV_ XXX macro does not need to check the global value. See the section "Enabling and Disabling Forwarding" for an example.

文件说明

File descriptions

以下是表 36-5中列出的文件的简要说明:

Here is a brief description of the files listed in Table 36-5:

接受重定向
accept_redirects

发送重定向
send_redirects

ICMP 重定向(如第 31 章所述)由路由器发送到主机,以通知它们有关次优路由的信息。 accept_redirects是一个布尔标志,可用于启用或禁用接口的 ICMP 重定向处理。[ * ] send_redirects用于硬币的另一面:当它为 true 时,系统可以在检测到次优路由所需的条件时生成 ICMP 重定向。

ICMP redirects, described in Chapter 31, are sent by routers to hosts to inform them about suboptimal routing. accept_redirects is a Boolean flag that can be used to enable or disable ICMP redirect processing for an interface.[*] send_redirects is used for the other side of the coin: when it is true, the system is allowed to generate ICMP redirects when the required conditions of suboptimal routing are detected.

接受源路由
accept_source_route

可以使用此标志启用和禁用 IP 源路由选项。禁用时,ip_rcv_finish丢弃所有携带该选项的 IP 数据包。IP 选项将在第 18 章中讨论。

The IP Source Routing option can be enabled and disabled with this flag. When it is disabled, ip_rcv_finish drops all the IP packets carrying such an option. IP options are discussed in Chapter 18.

转发
forwarding

MC_转发
mc_forwarding

这些是布尔标志,分别用于启用和禁用单播和多播转发。为了mc_forwarding 使用,内核必须使用必要的多播选项进行编译。

These are Boolean flags used to enable and disable unicast and multicast forwarding, respectively. For mc_forwarding to be used, the kernel must be compiled with the necessary multicast options.

rp_过滤器
rp_filter

当此标志为 true 时,如果可以通过非对称路由(根据本地主机的路由表)到达数据包的源,则丢弃入口数据包。请参阅第 31 章中的“反向路径过滤”部分。

When this flag is true, an ingress packet is dropped if the source of the packet is reachable through an asymmetric route (according to the routing table of the local host). See the section "Reverse Path Filtering" in Chapter 31.

安全重定向
secure_redirects

共享媒体
shared media

secure_redirects设置后,ICMP_REDIRECT仅当建议的网关在本地已知为网关时才接受消息。

通常,ICMP_REDIRECT按照 RFC 1122 中的规定,建议使用新的下一跃点(其 IP 地址与当前下一跃点不在同一子网中)的消息会被拒绝。但是,在某些情况下接受它们是有意义的。当shared_media为真时,这些ICMP_REDIRECT消息将被接受。RFC 1620 很好地解释了为什么此选项在某些情况下有意义。

有关此功能的更多信息,请参阅第 31 章中的“ ICMP_REDIRECT 消息”部分。

When secure_redirects is set, ICMP_REDIRECT messages are accepted only when the suggested gateway is already known locally as a gateway.

Normally, ICMP_REDIRECT messages that suggest the use of a new next hop whose IP address is not in the same subnet as the current next hop are rejected, as specified in RFC 1122. However, there are cases where accepting them would make sense. When shared_media is true, those ICMP_REDIRECT messages will be accepted. RFC 1620 explains quite nicely why this option makes sense in some cases.

See the section "ICMP_REDIRECT Messages" in Chapter 31 for more information on this feature.

火星人日志
log_martians

设置此标志后,内核在收到具有非法 IP 地址的数据包时会生成日志消息。请参阅第 31 章中的“详细监控”部分。

When this flag is set, the kernel generates log messages when it receives packets with illegal IP addresses. See the section "Verbose Monitoring" in Chapter 31.

/proc/net 和 /proc/net/stat 目录

The /proc/net and /proc/net/stat Directories

/proc/net目录提供了一些文件,当您尝试转储其内容时,这些文件会执行内核处理程序。以下是逐个文件的说明:

The /proc/net directory offers a few files that execute kernel handlers when you try to dump their contents. The following is the file-by-file description:

路线
route

rt_cache
rt_cache

if_fib_main_table您可以读取这两个文件以分别获取路由表 ( ) 和路由缓存的转储。它们不显示用户定义的路由表的内容,当内核支持策略路由时可以创建用户定义的路由表。IP 地址以十六进制格式打印。

You can read those two files to get a dump of the routing table (if_fib_main_table) and the routing cache, respectively. They do not display the contents of user-defined routing tables, which can be created when the kernel has support for Policy Routing. IP addresses are printed in hexadecimal format.

统计/rt_cache
stat/rt_cache

统计数据的收集。请参阅“统计”和“ rt_cache_stat 结构”部分。

Collection of statistics. See the sections "Statistics" and "rt_cache_stat Structure."

rt_acct
rt_acct

第 31 章中介绍的基于路由表的分类器收集的记帐信息。要获得格式更好的输出,请使用 IPROUTE2 的rtacct命令。

Accounting information collected by the routing table based classifier introduced in Chapter 31. For better-formatted output, use IPROUTE2's rtacctcommand.

ip_mr_缓存
ip_mr_cache

ip_mr_vif
ip_mr_vif

由多播路由使用(本书未涉及)。

Used by multicast routing (not covered in this book).

表 36-6总结了文件和内核处理程序之间的关联。

Table 36-6 summarizes the association between files and kernel handlers.

表 36-6。路由子系统使用的 /proc/net 中文件的内核处理程序

Table 36-6. Kernel handlers for the files in /proc/net used by the routing subsystem

文件名

Filename

定义它的内核文件

Kernel file where it is defined

route

route

net/ipv4/fib_hash.c (fib_proc_init)

net/ipv4/fib_hash.c (fib_proc_init)

rt_cache

rt_cache

net/ipv4/route.c (ip_rt_init)

net/ipv4/route.c (ip_rt_init)

rt_acct

rt_acct

net/ipv4/route.c (ip_rt_init)

net/ipv4/route.c (ip_rt_init)

ip_mr_cache

ip_mr_cache

net/ipv4/ipmr.c (ip_mr_init)

net/ipv4/ipmr.c (ip_mr_init)

ip_mr_vif

ip_mr_vif

net/ipv4/ipmr.c (ip_mr_init)

net/ipv4/ipmr.c (ip_mr_init)

stat/rt_cache

stat/rt_cache

net/ipv4/route.c (ip_rt_init)

net/ipv4/route.c (ip_rt_init)

如图36-4所示, /proc/net/proc/net/stat这两个目录中的文件是通过inet_init和 等例程ipv4_proc_init间接创建的ip_initinet_init被宏标记 module_init,因此在引导时执行(参见第 7 章)。

As shown in Figure 36-4, the files in the two directories /proc/net and /proc/net/stat are created indirectly by inet_init, with the help of routines such as ipv4_proc_init and ip_init. inet_init is marked with the module_init macro and therefore is executed at boot time (see Chapter 7).

启用和禁用转发

Enabling and Disabling Forwarding

正如本章前面提到的,内核通过/proc导出参数 ,可用于在全局和每个设备上启用和禁用 IP 转发。在本章中,我们将仅讨论 IPv4 转发。

As mentioned earlier in this chapter, the kernel exports parameters via /proc that can be used to enable and disable IP forwarding, both globally and on a per-device basis. In this chapter, we will address only IPv4 forwarding.

尽管管理员可以全局更改转发状态,但实际上并不存在全局转发状态。路由代码仅使用每个设备的转发状态:全局配置更改只是一次性将相同更改应用于所有设备的一种便捷方法。特别是,当内核收到目的地址不属于本地系统的IP数据包时,它会根据接收接口的转发状态,转发该数据包或丢弃该数据包。这不是在全局基础上做出的决定,也不是根据用于将数据包传输到其目的地的设备的转发状态做出的决定。

Even though an administrator can change the forwarding state globally, there really is no global forwarding state. The routing code uses only the per-device forwarding states: global configuration changes are just a convenient way to apply the same change to all devices in one shot. In particular, when the kernel receives an IP packet whose destination address does not belong to the local system, it either forwards the packet or drops it based on the forwarding state of the receiving interface. This is not a decision made on a global basis or on the forwarding state of the device that would be used to transmit the packet out toward its destination.

了解每个设备和全局配置之间的关系非常重要,了解当您更改它们的值时系统将如何运行。以下是相关的/proc文件:

It is important to understand the relationship between per-device and global configurations, to know how the system is going to behave when you change their values. Here are the relevant /proc files:

/proc/sys/net/ipv4/conf //转发device_name
/proc/sys/net/ipv4/conf/ device_name / forwarding

启用和禁用设备上的转发 device_name。值为零表示禁用;任何其他值均表示已启用。

Enable and disable forwarding on the device device_name. A value of zero means disabled; any other value means enabled.

/proc/sys/net/ipv4/conf/all/转发
/proc/sys/net/ipv4/conf/all/forwarding

对此文件的更改将应用​​于所有网络设备(包括未处于 UP 状态的设备),但不会影响将来注册的设备的转发状态。

Changes to this file are applied to all network devices (including the ones not in the UP state) but do not affect the forwarding state of devices registered in the future.

/proc/sys/net/ipv4/conf/默认/转发
/proc/sys/net/ipv4/conf/default/forwarding

这是那些没有显式配置的设备的默认转发状态。与之前的文件不同,它的值仅影响将来注册的设备(而不是已经存在的设备)的转发状态。

This is the default forwarding state of those devices that do not have an explicit configuration. Unlike the previous file, its value affects only the forwarding state of those devices registered in the future (not the ones already present).

/proc/sys/net/ipv4/ip_forward
/proc/sys/net/ipv4/ip_forward

对此文件的更改与/proc/sys/net/ipv4/conf/all/forwarding 的更改具有相同的效果。您可以将前者视为后者的别名。

Changes to this file have the same effect as changes to /proc/sys/net/ipv4/conf/all/forwarding. You can look at the former as an alias to the latter.

对转发文件的更改由 处理 devinet_sysctl_forward,这在内部区分了三种情况。对ip_forward文件的更改 由ipv4_sysctl_forward. 每当至少一台设备的转发状态发生变化时,路由缓存就会被刷新rt_cache_flush

Changes to the forwarding files are processed by devinet_sysctl_forward, which distinguishes between the three cases internally. Changes to the ip_forward file are processed by ipv4_sysctl_forward. Every time there is a change of forwarding state for at least one device, the routing cache is flushed with rt_cache_flush.

/proc/sys/net/ipv4/conf/all/forwarding/proc/sys/net/ipv4/ip_forward 的更改将触发 的执行inet_forward_change,其中:

Changes to either /proc/sys/net/ipv4/conf/all/forwarding or /proc/sys/net/ipv4/ip_forward will trigger the execution of inet_forward_change, which:

  1. 更新ipv4_devconf.accept_redirect 配置参数。

    这样做是为了强制执行只有主机才应该接受 ICMP 重定向的规则,而不是路由器。如果启用全局转发,则意味着系统现在被视为路由器,因此必须禁用支持 ICMP 重定向的默认配置。(当然,管理员可以根据需要重新启用它)。

  2. Updates the ipv4_devconf.accept_redirect configuration parameter.

    This is done to enforce the rule by which only hosts are supposed to accept ICMP redirects, not routers. If global forwarding gets enabled, it means the system is now to be considered a router and therefore the default configuration for honoring ICMP redirects must be disabled. (The administrator can, of course, re-enable it if needed).

  3. 更新默认转发状态。

    请注意,更改全局转发配置会强制更改默认值,但反之则不然。

  4. Updates the default forwarding state.

    Note that changing the global forwarding configuration forces the default to change, but not vice versa.

  5. 更新所有设备的转发状态。

  6. Updates the forwarding state of all devices.

本书这一部分介绍的数据结构

Data Structures Featured in This Part of the Book

第32章的“主要数据结构”部分对主要数据结构进行了简要概述,第34章图34-1可以帮助您理解它们之间的关系。本节提供每种数据结构类型的详细描述。图 36-5显示了定义每个数据结构的文件。

The section "Main Data Structures" in Chapter 32 gave a brief overview of the main data structures, and Figure 34-1 in Chapter 34 can help you understand the relationships between them. This section provides a detailed description of each data structure type. Figure 36-5 shows the file that defines each data structure.

内核文件中数据结构的分布

图 36-5。内核文件中数据结构的分布

Figure 36-5. Distribution of data structures in kernel files

fib_table结构

fib_table Structure

fib_table为每个路由表实例创建一个结构。该结构主要由一个路由表标识符和一组用于管理该表的函数指针组成:

A fib_table structure is created for each routing table instance. The structure consists mainly of a routing table identifier and a set of function pointers used to manage the table:

unsigned char tb_id
unsigned char tb_id

路由表标识符。在include/linux/rtnetlink.h 中,您可以找到rt_class_t预定义值的列表,例如RT_TABLE_LOCAL

Routing table identifier. In include/linux/rtnetlink.h, you can find the rt_class_t list of predefined values, such as RT_TABLE_LOCAL.

unsigned tb_stamp
unsigned tb_stamp

不曾用过。

Not used.

int (*tb_lookup)(struct fib_table *tb, const struct flowi *flp, struct fib_result *res)
int (*tb_lookup)(struct fib_table *tb, const struct flowi *flp, struct fib_result *res)

该函数由第 35 章fib_lookup 中描述的例程调用。

The function called by the fib_lookup routine described in Chapter 35.

int (*tb_insert)(struct fib_table *table, struct rtmsg *r,struct kern_rta *rta, struct nlmsghdr *n, struct netlink_skb_parms *req)
int (*tb_insert)(struct fib_table *table, struct rtmsg *r,struct kern_rta *rta, struct nlmsghdr *n, struct netlink_skb_parms *req)

int (*tb_delete)(struct fib_table *table, struct rtmsg *r, struct kern_rta *rta, struct nlmsghdr *n, struct netlink_skb_parms *req);
int (*tb_delete)(struct fib_table *table, struct rtmsg *r, struct kern_rta *rta, struct nlmsghdr *n, struct netlink_skb_parms *req);

tb_insert由ip Route add/change/replace/prepend/append/testRoute add用户空间命令调用inet_rtm_newrouteip_rt_ioctl处理该命令。类似地,通过(响应iproute del ...命令)和 by (响应route del ...命令)调用以从表中删除路由。两者也被调用(参见“内核插入的路由:fib_magic 函数”部分)。tb_deleteinet_rtm_delrouteip_rt_ioctlfib_magic

tb_insert is called by inet_rtm_newroute and ip_rt_ioctl to process the ip route add/change/replace/prepend/append/test and route add user-space commands. Similarly, tb_delete is called by inet_rtm_delroute (in answer to ip route del ... commands) and by ip_rt_ioctl (in answer to route del ... commands) to delete a route from a table. Both are also called by fib_magic (see the section "Routes Inserted by the Kernel: The fib_magic Function").

int (*tb_dump)(struct fib_table *table, struct sk_buff *skb,
int (*tb_dump)(struct fib_table *table, struct sk_buff *skb,

struct netlink_callback *cb)
struct netlink_callback *cb)

转储路由表的内容。调用它来处理用户命令,例如“ ip route get ... ”。

Dumps the content of a routing table. It is invoked to handle user commands such as "ip route get ...".

int (*tb_flush)(struct fib_table *table)
int (*tb_flush)(struct fib_table *table)

删除fib_info设置了标志的结构RTNH_F_DEAD请参阅第 33 章中的“垃圾收集”部分。

Removes the fib_info structures that have the RTNH_F_DEAD flag set. See the section "Garbage Collection" in Chapter 33.

void (*tb_select_default)(struct fib_table *table, const struct flowi *flp, struct fib_result *res)
void (*tb_select_default)(struct fib_table *table, const struct flowi *flp, struct fib_result *res)

选择默认路由。请参阅第 35 章中的“默认网关选择” 部分。

Selects a default route. See the section "Default Gateway Selection" in Chapter 35.

unsigned char tb_data[0]
unsigned char tb_data[0]

指向结构体末尾的指针。当该结构被分配为较大结构的一部分时,它很有用,因为它允许代码指向紧随该结构的外部数据结构的部分。请参见第 34 章中的图 34-1

Pointer to the end of the structure. It is useful when the structure is allocated as part of a bigger one, because it allows code to point to the part of the outer data structure that immediately follows this one. See Figure 34-1 in Chapter 34.

fn_zone 结构

fn_zone Structure

区域是具有相同网络掩码长度的路由的集合。路由表的路由被组织成区域,如第 32 章所述。区域是用fn_zone 结构体定义的,其中包含以下字段:

A zone is the collection of routes that have the same netmask length. The routes of a routing table are organized into zones, as described in Chapter 32. Zones are defined with fn_zone structures, which contain the following fields:

struct fn_zone *fz_next
struct fn_zone *fz_next

用于将活动区域(即具有至少一条路线的区域)链接在一起的指针。列表的头部保存在 中fn_zone_list,它是数据结构的一个字段fn_hash

Pointer used to link together the active zones (i.e., the ones with at least one route). The head of the list is kept in fn_zone_list, which is a field of the fn_hash data structure.

struct hlist_head *fz_hash
struct hlist_head *fz_hash

指向存储落入该区域的路由的哈希表的指针。

Pointer to the hash table that stores the routes that fall into this zone.

int fz_nent
int fz_nent

区域中的路由数量(即fib_node区域哈希表中的实例数量)。例如,它的值用于检测是否需要调整哈希表的大小(请参阅第 34 章中的“动态调整每个网络掩码哈希表的大小”部分)。

Number of routes in the zone (i.e., number of fib_node instances that are in the zone's hash table). Its value is used, for instance, to detect the need to resize the hash table (see the section "Dynamic resizing of per-netmask hash tables" in Chapter 34).

int fz_divisor
int fz_divisor

哈希表的大小(桶数)fz_hash请参阅第 34 章中的“动态调整每个网络掩码哈希表的大小”部分。

Size (number of buckets) of the hash table fz_hash. See the section "Dynamic resizing of per-netmask hash tables" in Chapter 34.

u32 fz_hashmask
u32 fz_hashmask

这只是fz_divisor-1,提供它是为了可以使用廉价的二进制 AND 运算而不是昂贵的模运算来计算值 modulo fz_divisor。与(例如)相同 ,后者占用的 CPU 时间更少。n %fz_divisorn &fz_hashmask100%16 = 100&15

This is simply fz_divisor-1, and is provided so that cheap binary AND operations can be used instead of expensive modulo operations to compute a value modulo fz_divisor. n %fz_divisor is the same as n &fz_hashmask (for instance, 100%16 = 100&15), and the latter takes less CPU time.

int fz_order
int fz_order

在网络掩码中设置的位数(全部连续)fz_hashmask,在代码中也视为prefixlen。例如,给定网络掩码 255.255.255.0,fz_order则为 24。

The number of bits (all consecutive) that are set in the netmask fz_hashmask, also seen in the code as prefixlen. For instance, given the netmask 255.255.255.0, fz_order would be 24.

u32 fz_mask
u32 fz_mask

使用构建的网络掩码fz_order。例如,fz_order3 的结果为二进制 fz_mask11100000.00000000.00000000.00000000,或十进制 224.0.0.0。

The netmask built using fz_order. For example, an fz_order of 3 produces a binary fz_mask of 11100000.00000000.00000000.00000000, or decimal 224.0.0.0.

除了该结构之外,还有两个用于访问fz_hashmaskfz_mask字段的宏:

Along with the structure are two macros used to access the fz_hashmask and fz_mask fields:

#define FZ_HASHMASK(fz) ((fz)->fz_hashmask)
#define FZ_MASK(fz) ((fz)->fz_mask)

fib_node 结构

fib_node Structure

每个唯一的目标网络都有一个fib_node实例,内核有一个路由。通向同一目标网络但其他配置参数不同的不同路由共享同一fib_node实例。以下是逐个字段的描述:

There is a fib_node instance for each unique destination network for which the kernel has a route. Different routes that lead to the same destination network but that differ with regard to other configuration parameters share the same fib_node instance. Here is the field-by-field description:

struct hlist_node fn_hash
struct hlist_node fn_hash

fib_node元素被组织成哈希表。该指针用于链接哈希表的单个存储桶中发生冲突的元素。

fib_node elements are organized into hash tables. This pointer is used to link the elements that collide in a single bucket of a hash table.

struct list_head fn_alias
struct list_head fn_alias

每个fib_node结构都与一个或多个结构的列表相关联fib_alias。这是指向该列表头部的指针。

Each fib_node structure is associated with a list of one or more fib_alias structures. This is the pointer to the head of that list.

su32 fn_key
su32 fn_key

这是路由的前缀(网络地址,由路由的网络掩码指示)。它用作搜索关键字。请参阅第 34 章中的“哈希表组织的基本结构”部分。

This is the prefix of the route (the network address, indicated by the route's netmask). It is used as a search key. See the section "Basic Structures for Hash Table Organization" in Chapter 34.

fib_alias 结构

fib_alias Structure

fib_alias实例用于区分到同一目标网络的不同路由,这些路由在其他配置参数(除了目标地址之外)方面有所不同。以下是逐个字段的描述:

fib_alias instances are used to distinguish between different routes to the same destination network that differ with regard to other configuration parameters (besides the destination address). Here is the field-by-field description:

struct list_head fa_list
struct list_head fa_list

用于链接fib_alias与同一fib_node 结构关联的实例。

Used to link the fib_alias instances associated with the same fib_node structure.

struct fib_info *fa_info
struct fib_info *fa_info

指向fib_info存储有关如何处理与此路由匹配的数据包的信息的实例的指针。

Pointer to the fib_info instance that stores the information about how to process packets matching this route.

u8 fa_tos
u8 fa_tos

路由的服务类型 (TOS) 位字段。当该值为零时,表示尚未配置 TOS,因此任何值都可以在路由查找中匹配。不要fa_tosr_tos的字段混淆fib_rule。该fa_tos字段允许用户为各个路由条目指定 TOS 上的条件。相反,r_tos字段fib_rule指定策略规则的 TOS 条件。

Route's Type of Service (TOS) bitfield. When the value is zero, it means the TOS has not been configured and therefore any value can match on a routing lookup. Do not confuse fa_tos with the r_tos field of fib_rule. The fa_tos field allows the user to specify conditions on the TOS for individual routing entries. In contrast, the r_tos field of fib_rule specifies conditions on the TOS for policy rules.

u8 fa_type
u8 fa_type

请参阅“ rtable 结构rt_type”部分中对该字段的描述。

See the description of the rt_type field in the section "rtable Structure."

u8 fa_scope
u8 fa_scope

路线范围。请参阅第 30 章中的“范围”部分。

Scope of the route. See the section "Scope" in Chapter 30.

u8 fa_state
u8 fa_state

标志位图。到目前为止定义的唯一标志如下:

Bitmap of flags. The only flag defined so far is the following:

FA_S_ACCESSED
FA_S_ACCESSED

每当fib_alias通过查找访问实例时,都会用此标志来标记它。当更改应用于fib_node数据结构时,该标志非常有用:它用于决定是否应刷新路由缓存。如果fib_node已经被访问过,则可能意味着如果路由发生变化,需要清除路由缓存中的条目;因此,刷新被触发。

Whenever the fib_alias instance is accessed with a lookup, it is marked with this flag. The flag is useful when a change is applied to a fib_node data structure: it is used to decide whether the routing cache should be flushed. If fib_node has been accessed, it probably means entries in the routing cache need to be cleared if the route changes; thus, a flush is triggered.

fib_info结构

fib_info Structure

定义路由的参数包含在fib_nodefib_alias结构的组合中,如前面部分所述。重要的路由信息​​(例如下一跳网关)存储在 fib_info结构中。以下是逐个字段的描述:

The parameters that define a route are contained in the combination of fib_node and fib_alias structures, described in the previous sections. Important routing information such as the next hop gateway is stored in a fib_info structure. Here is the field-by-field description:

struct hlist_node fib_hash
struct hlist_node fib_hash

struct hlist_node fib_lhash
struct hlist_node fib_lhash

用于将数据结构插入到第34章“ fib_info结构的组织”一节中描述的两个哈希表中。

Used to insert the data structure into the two hash tables described in the section "Organization of fib_info Structures" in Chapter 34.

int fib_treeref
int fib_treeref

atomic_t fib_clntref
atomic_t fib_clntref

参考计数。是在该实例上持有引用的数据结构fib_treeref的数量,并且是作为成功路由查找的结果而持有的引用的数量。fib_nodefib_infofib_clntref

Reference counts. fib_treeref is the number of fib_node data structures holding a reference on this fib_info instance, and fib_clntref is the number of references being held as a result of successful routing lookups.

int fib_dead
int fib_dead

标记正在删除的路由的标志。当设置为 1 时,它警告不要使用该结构,因为它将被删除。请参见第34 章中的“删除路由”部分。

A flag that tags routes being removed. When set to 1, it warns that the structure is not to be used because it is about to be removed. See the section "Deleting a Route" in Chapter 34.

unsigned fib_flags
unsigned fib_flags

表 36-7RTNH_F_ XXX中列出的标志集。当前使用的唯一标志是,当所有关联结构都设置了标志时,为多路径路由设置该标志(请参阅第 32 章中的“通用帮助程序例程和宏”部分)。RTNH_F_DEADfib_nhRTNH_F_DEAD

Set of RTNH_F_ XXX flags, listed in Table 36-7. The only flag currently used is RTNH_F_DEAD, which is set for a multipath route when all the associated fib_nh structures have their RTNH_F_DEAD flags set (see the section "Generic Helper Routines and Macros" in Chapter 32).

表 36-7。fib_nh 的 nh_flags 字段的值

Table 36-7. Values for the nh_flags field of fib_nh

旗帜

Flag

描述

Description

RTNH_F_DEAD

RTNH_F_DEAD

该标志主要由多路径代码用来跟踪失效的下一跳。请参阅第 32 章“通用帮助例程和宏fib_sync_down 部分的描述。

This flag is used mainly by the multipath code to keep track of dead next hops. See the description of fib_sync_down in the section "Generic Helper Routines and Macros" in Chapter 32.

RTNH_F_PERVASIVE

RTNH_F_PERVASIVE

该标志应该标记需要递归查找但当前未使用的条目。最新的 IPROUTE2 版本不再接受该pervasive关键字。

This flag is supposed to mark entries that require recursive lookups but is currently not used. The latest IPROUTE2 releases do not accept the pervasive keyword anymore.

RTNH_F_ONLINK

RTNH_F_ONLINK

当设置此标志时,要求内核不检查下一跳地址的一致性(即不检查下一跳地址在传出设备上是否可达)。它是用onlink关键字设置的,例如在隧道虚拟设备上定义路由时使用。

When this flag is set, the kernel is asked not to check for consistency on the next hop address (i.e., not to check whether the next hop address is reachable on the outgoing device). It is set with the onlink keyword and is used, for instance, when defining routes on tunnel virtual devices.

int fib_protocol
int fib_protocol

安装路由的协议。该字段的可能值 在include/linux/rtnetlink.hRTPROT_ XXX中定义,并在表 36-836-9中列出(这些表中缺少 ROUTED,因为它不使用 Netlink 与内核交互)。有关这些协议的简要概述,请参阅第 31 章中的“路由协议守护进程”部分。

fib_protocol大于 的值RTPROT_STATIC仅由不是由内核生成的路由(即由用户空间路由守护程序生成的路由)使用。

使用此字段的一个示例是允许路由守护程序在处理内核时将操作限制为它们自己的路由。更多细节请参见第 31 章中的“守护进程和内核之间的交互”部分。

Protocol that installed the route. The possible values for this field, RTPROT_ XXX, are defined in include/linux/rtnetlink.h and are listed in Tables 36-8 and 36-9 (ROUTED is missing from these tables because it does not use Netlink to interface with the kernel). See the section "Routing Protocol Daemons" in Chapter 31 for a brief overview of these protocols.

Values of fib_protocol greater than RTPROT_STATIC are used only by routes not generated by the kernel (i.e., those generated by user-space routing daemons).

One example of a use for this field is to allow routing daemons to restrict operations to their own routes when dealing with the kernel. See the section "Interaction between daemons and kernel" in Chapter 31 for more details.

表 36-8。内核使用的 fib_protocol 的值

Table 36-8. Values of fib_protocol used by the kernel

价值

Value

描述

Description

RTPROT_UNSPEC

RTPROT_UNSPEC

字段无效。

Field is invalid.

RTPROT_REDIRECT

RTPROT_REDIRECT

由 ICMP 重定向安装的路由;当前 IPv4 未使用。

Route installed by ICMP redirects; not used by current IPv4.

RTPROT_KERNEL

RTPROT_KERNEL

由内核安装的路由。请参阅“内核插入的路由:fib_magic 函数”部分。

Route installed by kernel. See the section "Routes Inserted by the Kernel: The fib_magic Function."

RTPROT_BOOT

RTPROT_BOOT

由用户空间命令安装的路由,例如ip RouteRoute

Route installed by user-space commands such as ip route and route.

RTPROT_STATIC

RTPROT_STATIC

由管理员安装的路由。不曾用过。

Route installed by administrator. Not used.

表 36-9。用户空间使用的 fib_protocol 的值

Table 36-9. Values of fib_protocol used by user space

价值

Value

描述

Description

RTPROT_GATED

RTPROT_GATED

该路线由 GateD 添加。

The route was added by GateD.

RTPROT_RA

RTPROT_RA

该路由是通过 RDISC (IPv4) 和 ND (IPv6) 路由器通告添加的。RFC 1256 中定义的 ICMP 路由器发现协议这一机制可让主机找到相邻路由器。rdisciputils包的一部分,是实现 ICMP 路由器发现消息的用户空间工具。

The route was added by RDISC (IPv4) and ND (IPv6) router advertisements. There is a mechanism, the ICMP Router Discovery Protocol defined in RFC 1256, that lets hosts find neighboring routers. rdisc, which is part of the iputils package, is the user-space tool that implements ICMP Router Discovery Messages.

RTPROT_MRT

RTPROT_MRT

该路由是由多线程路由工具包(MRT)添加的。

The route was added by the Multi-Threaded Routing Toolkit (MRT).

RTPROT_ZEBRA

RTPROT_ZEBRA

该路线是由 Zebra 添加的。

The route was added by Zebra.

RTPROT_BIRD

RTPROT_BIRD

该路线由 BIRD 添加。

The route was added by BIRD.

RTPROT_DNROUTED

RTPROT_DNROUTED

该路由由 DECnet 路由守护程序添加。

The route was added by the DECnet routing daemon.

RTPROT_XORP

RTPROT_XORP

该路由由 XORP 路由守护程序添加。

The route was added by the XORP routing daemon.

u32 fib_prefsrc
u32 fib_prefsrc

首选源 IP 地址。参见第35 章“选择源IP 地址”部分。

Preferred source IP address. See the section "Selecting the Source IP Address" in Chapter 35.

u32 fib_priority
u32 fib_priority

路线的优先级。值越小,优先级越高。其值可以使用 IPROUTE2 使用metric//关键字priority进行配置preference。当未显式设置时,它具有由内核初始化的默认值 0。

Priority of the route. The smaller the value, the higher the priority. Its value can be configured with IPROUTE2 using the metric/priority/preference keywords. When not explicitly set, it has the default value 0 to which it is initialized by the kernel.

u32 fib_metrics[RTAX_MAX]
u32 fib_metrics[RTAX_MAX]

配置路由时,ip route 命令还允许您指定一组指标。fib_metrics是用于存储它们的向量。未显式配置的指标将初始化为零。有关可用指标的列表,请参阅第 30 章中的“路由的基本元素” 部分。表 36-10显示了该部分中列出的指标与include/linux/rtnetlink.h中定义的相关内核符号之间的关系。RTAX_ XXX

When you configure a route, the ip route command allows you to also specify a set of metrics. fib_metrics is a vector used to store them. Metrics not explicitly configured are initialized to zero. See the section "Essential Elements of Routing" in Chapter 30 for a list of the available metrics. Table 36-10 shows the relationships between the metrics listed in that section and the associated kernel symbols RTAX_ XXX defined in include/linux/rtnetlink.h.

表 36-10。路由指标

Table 36-10. Routing metrics

公制

Metric

内核符号

Kernel symbol

不是一个指标

Not a metric

RTAX_LOCK

RTAX_LOCK

路径MTU

Path MTU

RTAX_MTU

RTAX_MTU

最大广告窗口

Maximum Advertised Window

RTAX_WINDOW

RTAX_WINDOW

往返时间

Round Trip Time

RTAX_RTT

RTAX_RTT

RTT方差

RTT Variance

RTAX_RTTVAR

RTAX_RTTVAR

慢启动阈值

Slow Start threshold

RTAX_SSTHRESH

RTAX_SSTHRESH

拥塞窗口

Congestion Window

RTAX_CWND

RTAX_CWND

最大段尺寸

Maximum Segment Size

RTAX_ADVMSS

RTAX_ADVMSS

最大重新排序

Maximal Reordering

RTAX_REORDERING

RTAX_REORDERING

默认生存时间 (TTL)

Default Time To Live (TTL)

RTAX_HOPLIMIT

RTAX_HOPLIMIT

初始拥塞窗口

Initial Congestion Window

RTAX_INITCWND

RTAX_INITCWND

不是一个指标

Not a metric

RTAX_FEATURES

RTAX_FEATURES

int fib_power
int fib_power

仅当内核编译为支持多路径时,该字段才是数据结构的一部分。请参阅第 31 章中的“多路径路由背后的概念”部分。

This field is part of the data structure only when the kernel is compiled with support for multipath. See the section "Concepts Behind Multipath Routing" in Chapter 31.

struct fib_nh fib_nh[0]
struct fib_nh fib_nh[0]

int fib_nhs
int fib_nhs

fib_nh是结构的变长向量 fib_nhfib_nhs是其大小。fib_nhs仅当内核支持多路径功能时才可以大于1。请参见第 31 章中的“多路径路由背后的概念”部分,以及第 34 章中的图 34-1

fib_nh is a variable-length vector of fib_nh structures, and fib_nhs is its size. fib_nhs can be greater than 1 only when the kernel supports the Multipath feature. See the section "Concepts Behind Multipath Routing" in Chapter 31, and see Figure 34-1 in Chapter 34.

u32 fib_mp_alg
u32 fib_mp_alg

多路径缓存算法。第 31 章“多路径的缓存支持IP_MP_ALG_ XXX部分中介绍的算法的 ID列在include/linux/ip_mp_alg.h中。仅当内核编译为支持多路径缓存时,该字段才是数据结构的一部分。

Multipath caching algorithm. The IP_MP_ALG_ XXX IDs of the algorithms introduced in the section "Cache Support for Multipath" in Chapter 31 are listed in include/linux/ip_mp_alg.h. This field is part of the data structure only when the kernel is compiled with support for multipath caching.

#define fib_dev fib_nh[0].nh_dev
#define fib_dev fib_nh[0].nh_dev

用于访问向量nh_dev第一个fib_nh实例的字段的宏fib_nh。请参见第 34 章中的图 34-1

Macro used to access the nh_dev field of the first fib_nh instance of the fib_nh vector. See Figure 34-1 in Chapter 34.

#define fib_mtu fib_metrics[RTAX_MTU-1]
#define fib_mtu fib_metrics[RTAX_MTU-1]

#define fib_window fib_metrics[RTAX_WINDOW-1]
#define fib_window fib_metrics[RTAX_WINDOW-1]

#define fib_rtt fib_metrics[RTAX_RTT-1]
#define fib_rtt fib_metrics[RTAX_RTT-1]

#define fib_advmss fib_metrics[RTAX_ADVMSS-1]
#define fib_advmss fib_metrics[RTAX_ADVMSS-1]

用于访问fib_metrics向量的特定元素的宏。

Macros used to access specific elements of the fib_metrics vector.

fib_nh 结构

fib_nh Structure

对于每个下一跳,内核需要保留的不仅仅是 IP 地址。该fib_nh 结构将额外信息存储在以下字段中。

For each next hop, the kernel needs to keep more than just the IP address. The fib_nh structure stores that extra information in the following fields.

struct net_device *nh_dev
struct net_device *nh_dev

这是net_device与设备ID nh_oif(稍后描述)相关的数据结构。由于需要 ID 和指向net_device结构的指针(在不同的上下文中),因此它们都保留在结构中fib_nh,即使其中一个可用于检索另一个。

This is the net_device data structure associated with the device ID nh_oif (described later). Since both the ID and the pointer to the net_device structure are needed (in different contexts), both of them are kept in the fib_nh structure, even though either one could be used to retrieve the other.

struct hlist_node nh_hash
struct hlist_node nh_hash

用于将该结构插入第34章“下一跳路由器结构的组织”部分中描述的哈希表中。

Used to insert the structure into the hash table described in the section "Organization of Next-Hop Router Structures" in Chapter 34.

struct fib_info *nh_parent
struct fib_info *nh_parent

指向fib_info包含此fib_nh实例的结构的指针。请参见第 34 章中的图 34-1

Pointer to the fib_info structure that contains this fib_nh instance. See Figure 34-1 in Chapter 34.

unsigned nh_flags
unsigned nh_flags

在include/linux/rtnetlink.h中定义并在本章前面的表 36-7中列出的一组RTNH_F_ XXX标志。

A set of RTNH_F_ XXX flags defined in include/linux/rtnetlink.h and listed in Table 36-7 earlier in this chapter.

unsigned char nh_scope
unsigned char nh_scope

用于到达下一跳的路由范围。RT_SCOPE_LINK大多数情况下都是如此。该字段由 初始化fib_check_nh

Scope of the route used to get to the next hop. It is RT_SCOPE_LINK in most cases. This field is initialized by fib_check_nh.

int nh_weight
int nh_weight

int nh_power
int nh_power

fib_nh 只有当内核编译为支持多路径时,这两个字段才是数据结构的一部分,并且在第31章的“多路径路由背后的概念”部分中有详细描述。由内核初始化;由用户使用关键字设置。nh_powernh_weightweight

These two fields are part of the fib_nh data structure only when the kernel is compiled with support for multipath, and are described in detail in the section "Concepts Behind Multipath Routing" in Chapter 31. nh_power is initialized by the kernel; nh_weight is set by the user with the keyword weight.

_ _u32 nh_tclassid
_ _u32 nh_tclassid

fib_nh仅当内核编译为支持基于路由表的分类器时,该字段才是数据结构的一部分。它的值是用realms关键字设置的。请参阅第 35 章中的“策略路由和基于路由表的分类器”部分。

This field is part of the fib_nh data structure only when the kernel is compiled with support for the routing table based classifier. Its value is set with the realms keyword. See the section "Policy Routing and Routing Table Based Classifier" in Chapter 35.

int nh_oif
int nh_oif

出口设备ID。它是用关键字oif和设置的dev

ID of the egress device. It is set with the keywords oif and dev.

u32 nh_gw
u32 nh_gw

关键字指定的下一跳网关IP地址via。请注意,在 NAT 的情况下,这表示 NAT 路由器向全世界通告的地址,以及路由器将回复发送到内部网络上的主机之前发送到的地址。例如,命令ip route add nat 10.1.1.253/32 via 151.41.196.1将设置nh_gw为 151.41.196.1。请注意,路由代码中的 NAT 支持(称为 FastNAT)已在 2.6 内核中删除。

IP address of the next hop gateway provided with the keyword via. Note that in the case of NAT, this represents the address that the NAT router advertises to the world, and to which replies are sent before the router sends them on to the host on the internal network. For example, the command ip route add nat 10.1.1.253/32 via 151.41.196.1 would set nh_gw to 151.41.196.1. Note that NAT support in the routing code, known as FastNAT, has been dropped in 2.6 kernels.

fib_rule 结构

fib_rule Structure

策略路由规则(也称为策略)是使用iprule命令配置的。如果您的 Linux 系统上安装了 IPROUTE2 软件包,您可以输入ip Rule help查看该命令的语法。策略存储在fib_rule结构中,其字段描述如下:

Policy routing rules (also called policies) are configured with the ip rule command. If the IPROUTE2 package is installed on your Linux system, you can type ip rule help to see the syntax of the command. Policies are stored in fib_rule structures, whose fields are described here:

struct fib_rule *r_next
struct fib_rule *r_next

在包含所有结构的全局列表中链接这些结构fib_rule(参见第 35 章中的图 35-8)。

Links these structures within a global list that contains all fib_rule structures (see Figure 35-8 in Chapter 35).

atomic_t r_clntref
atomic_t r_clntref

参考计数。它递增fib_lookup(仅在策略路由版本中),这解释了为什么 fib_res_put在成功查找后总是调用(递减它)。

Reference count. It is incremented by fib_lookup (in the Policy Routing version only), which explains why fib_res_put (which decrements it) is always called after a successful lookup.

u32 r_preference
u32 r_preference

规则的优先级。这可以使用关键字 进行配置prioritypreference并且order当管理员使用 IPROUTE2 添加策略时。如果未显式配置,内核会分配一个比最后一个用户添加的规则的优先级小一个单位的优先级(请参阅 参考资料inet_rtm_newrule)。优先级 0、0x7FFE 和 0x7FFF 保留用于内核安装的特殊规则(请参阅第 35 章中的“带有策略路由的 fib_lookup ”部分,以及net/ipv4/fib_rules.c中三个默认规则、、 和 的定义) 。local_rulemain_ruledefault_rule

Priority of the rule. This can be configured using the keywords priority, preference and order when the administrator adds a policy with IPROUTE2. When not explicitly configured, the kernel assigns a priority that is one unit smaller than the priority of the last user-added rule (see inet_rtm_newrule). Priorities 0, 0x7FFE, and 0x7FFF are reserved for special rules installed by the kernel (see the section "fib_lookup with Policy Routing" in Chapter 35, and the definitions of the three default rules local_rule, main_rule, and default_rule in net/ipv4/fib_rules.c).

unsigned char r_table
unsigned char r_table

路由表标识符。范围从0到255。当用户没有指定时,IPROUTE2使用以下默认值:RT_TABLE_MAIN当用户命令添加规则时,以及RT_TABLE_UNSPEC在其他情况下(例如,删除规则时)。

Routing table identifier. Ranges from 0 to 255. When it is not specified by the user, IPROUTE2 uses the following defaults: RT_TABLE_MAIN when the user command adds a rule, and RT_TABLE_UNSPEC in other cases (e.g., when deleting a rule).

unsigned char r_action
unsigned char r_action

该字段允许的值是include/linux/rtnetlink.hrtm_type中列出的枚举(等)。这些值的含义在“ rtable 结构”部分中描述。RTN_UNICAST

该字段可以由用户type在配置规则时使用关键字显式设置。当用户没有明确配置时,IPROUTE2RTN_UNICAST在添加规则时将其设置为,RTN_UNSPEC否则(例如,在删除规则时)。

The values allowed for this field are the rtm_type enum listed in include/linux/rtnetlink.h (RTN_UNICAST, etc.). The meanings of these values are described in the section "rtable Structure."

This field can be explicitly set by the user using the type keyword when configuring a rule. When it is not explicitly configured by the user, IPROUTE2 sets it to RTN_UNICAST when adding rules, and RTN_UNSPEC otherwise (e.g., when deleting rules).

unsigned char r_dst_len
unsigned char r_dst_len

unsigned char r_src_len
unsigned char r_src_len

目标和源 IP 地址的长度,以位表示。它们用于计算r_srcmaskr_dstmask。未初始化时,它们设置为零。

Length of the destination and source IP addresses, expressed in bits. They are used to compute r_srcmask and r_dstmask. When not initialized, they are set to zero.

u32 r_src
u32 r_src

u32 r_srcmask
u32 r_srcmask

数据包必须来自的源网络的 IP 地址和网络掩码。

IP address and netmask, respectively, of the source network from which packets must come.

u32 r_dst
u32 r_dst

u32 r_dstmask
u32 r_dstmask

数据包必须定向到的目标网络的 IP 地址和网络掩码。

IP address and netmask, respectively, of the destination network to which packets must be directed.

u32 r_srcmap
u32 r_srcmap

过去使用用户空间关键字设置的字段nat,并由map-to路由 NAT 实现使用。路由 NAT 支持已被删除,因此该字段不再使用。请参阅第 32 章中的“最近删除的选项” 部分。

Field that used to be set with the user-space keywords nat and map-to and was used by the Routing NAT implementation. Routing NAT support has been removed, so this field is not used anymore. See the section "Recently Dropped Options" in Chapter 32.

u8 r_flags
u8 r_flags

一组标志。目前未使用。

Set of flags. Currently not used.

u8 r_tos
u8 r_tos

IP 标头的 TOS 字段。包含在内是因为规则的定义可以包含放置在 IP 标头 TOS 字段上的条件。

IP header's TOS field. Included because the definition of a rule can include a condition placed on the IP header TOS field.

u32 r_fwmark
u32 r_fwmark

当内核编译为支持“使用 Netfilter MARK 值作为路由键”功能时,可以根据防火墙标记定义规则。fwmark这是管理员定义策略规则时通过关键字指定的标签。

When the kernel is compiled with support for the "Use Netfilter MARK value as routing key" feature, it is possible to define rules in terms of firewall tags. This is the tag specified by the fwmark keyword when the administrator defines a policy rule.

int r_ifindex
int r_ifindex

char r_ifname[IFNAMSIZ]
char r_ifname[IFNAMSIZ]

r_ifname是策略应用到的设备的名称。给定r_ifname,内核找到关联的实例并将其字段 net_device的值复制到。值 -1用于禁用规则(请参阅第 32 章中的“对策略数据库的影响”部分) 。ifindexr_ifindexr_ifindex

r_ifname is the name of the device the policy applies to. Given r_ifname, the kernel finds the associated net_device instance and copies the value of its ifindex field into r_ifindex. The value -1 for r_ifindex is used to disable the rule (see the section "Impacts on the policy database" in Chapter 32.

_ _u32 r_tclassid;
_ _u32 r_tclassid;

仅当内核编译为支持基于路由表的分类器时,该字段才包含在数据结构中。其含义在第35章的“策略路由和基于路由表的分类器”一节中描述。

This field is included in the data structure only when the kernel is compiled with support for the routing table based classifier. Its meaning is described in the section "Policy Routing and Routing Table Based Classifier" in Chapter 35.

int r_dead
int r_dead

当规则可供使用时,该字段为 0。当使用 删除规则时inet_rtm_delrule,该字段设置为 1。每次fib_rule使用 删除对数据结构的引用时fib_rule_put,引用计数都会递减,当达到零该结构应该被释放。然而,此时,如果r_dead 未设置,则意味着发生了错误(例如,代码错误地设置了引用计数)。

When a rule is available for use, this field is 0. When the rule is removed with inet_rtm_delrule, this field is set to 1. Every time a reference to the fib_rule data structure is removed with fib_rule_put, the reference count is decremented, and when it gets to zero the structure is supposed to be freed. At that point, however, if r_dead is not set, it means that something wrong happened (for instance, code has set the reference count incorrectly).

fib_result 结构

fib_result Structure

fib_result结构根据fib_semantic_match路由查找的结果进行初始化。有关更多详细信息,请参阅第 33 章和第 35(特别是“辅助标准的语义匹配”部分)。该结构中的字段是:

The fib_result structure is initialized by fib_semantic_match to the result of a routing lookup. See Chapters 33 and 35 (in particular, the section "Semantic Matching on Subsidiary Criteria") for more details. The fields in the structure are:

unsigned char prefixlen
unsigned char prefixlen

匹配路由的前缀长度。请参阅“ fn_zone 结构fz_order”部分的描述。

Prefix length of the matching route. See the description of fz_order in the section "fn_zone Structure."

unsigned char nh_sel
unsigned char nh_sel

多路径路由由多个下一跳定义。该字段标识已选择的下一跳。

Multipath routes are defined with multiple next hops. This field identifies the next hop that has been selected.

unsigned char type
unsigned char type

unsigned char scope
unsigned char scope

这两个字段被初始化为匹配实例的fa_type和字段的值。fa_scopefib_alias

These two fields are initialized to the values of the fa_type and fa_scope fields of the matching fib_alias instance.

_ _u32 network
_ _u32 network

_ _u32 netmask
_ _u32 netmask

仅当内核编译为支持多路径缓存时,这两个字段才会包含在数据结构定义中。请参阅第 33 章中的“加权随机算法” 部分,了解加权随机多路径缓存算法如何使用它们。

These two fields are included in the data structure definition only when the kernel is compiled with support for multipath caching. See the section "Weighted Random Algorithm" in Chapter 33 for how they are used by the weighted random multipath caching algorithm.

struct fib_info *fi
struct fib_info *fi

fib_info与匹配实例关联的实例fib_alias

The fib_info instance associated with the matching fib_alias instance.

struct fib_rule *r
struct fib_rule *r

与前面的字段不同,该字段由 初始化fib_lookup。仅当内核编译为支持策略路由时,该字段才会包含在数据结构定义中。

Unlike the previous fields, this one is initialized by fib_lookup. This field is included in the data structure definition only when the kernel is compiled with support for Policy Routing.

稳定的结构

rtable Structure

IPv4 使用rtable数据结构在缓存中存储路由表条目。[ * ]要转储路由缓存的内容,您可以查看(请参阅“通过 /proc 文件系统进行调整/proc/net/rt_cache”部分),或发出ip Route List cacheRoute -C命令。以下是数据结构的逐字段描述:

IPv4 uses rtable data structures to store routing table entries in the cache.[*] To dump the contents of the routing cache, you can view /proc/net/rt_cache (see the section "Tuning via /proc Filesystem"), or issue the ip route list cacheor route -C commands. Here is a field-by-field description of the data structure:

union {...} u
union {...} u

这个联合用于将一个dst_entry 结构嵌入到结构中(参见第33章中的“哈希表组织rtable一节)。它的字段之一,用于链接 碰撞到同一哈希表存储桶中的实例。rt_nextrtable

This union is used to embed a dst_entry structure into the rtable structure (see the section "Hash Table Organization" in Chapter 33). One of its fields, rt_next, is used to link the rtable instances that collide into the same hash table's bucket.

struct in_device *idev
struct in_device *idev

指向出口设备的IP配置块的指针。需要注意的是,当该路由用于入端报文需要本地下发时,出端设备为Loopback设备。

Pointer to the IP configuration block of the egress device. Note that when the route is used for ingress packets that are to be delivered locally, the egress device is the loopback device.

unsigned rt_flags
unsigned rt_flags

您可以在此位图中设置的标志是include/linux/in_route.hRTCF_ XXX中定义并在表 36-11中列出的值。

The flags you can set in this bitmap are the RTCF_ XXX values defined in include/linux/in_route.h and listed in Table 36-11.

表 36-11。rt_flags 的可能值

Table 36-11. Possible values for rt_flags

旗帜

Flag

描述

Description

RTCF_NOTIFY

RTCF_NOTIFY

感兴趣的用户空间应用程序会通过 Netlink 收到路由条目的任何更改通知。此选项尚未完全实施。该标志是使用ip route get 10.0.1.0/24 notify等命令设置的。

Interested user-space applications are notified of any change to the routing entry via Netlink. This option is not yet completely implemented. The flag is set with commands such as ip route get 10.0.1.0/24 notify.

RTCF_REDIRECTED

RTCF_REDIRECTED

该条目是为了响应收到的ICMP_REDIRECT消息而添加的(请参阅ip_rt_redirect及其调用者)。

The entry has been added in response to a received ICMP_REDIRECTmessage (see ip_rt_redirect and its caller).

RTCF_DOREDIRECT

RTCF_DOREDIRECT

ip_route_input_slowICMP_REDIRECT必须将消息发送回源时设置此标志。,在第20章ip_forward中详细描述,根据该标志和其他信息决定是否实际发送ICMP重定向。例如,如果数据包是源路由的,则不会生成 ICMP 重定向。

This flag is set by ip_route_input_slow when an ICMP_REDIRECT message must be sent back to the source. ip_forward, described in detail in Chapter 20, decides whether to actually send the ICMP redirect based on this flag and other information. For instance, if the packet was source routed, no ICMP redirect would be generated.

RTCF_DIRECTSRC

RTCF_DIRECTSRC

该标志主要用于告诉 ICMP 代码不应回复地址掩码请求消息。每次调用时都会设置该标志,fib_validate_source表示接收到的数据包的源可以通过具有本地范围 ( RT_SCOPE_HOST) 的下一跳到达。详细信息请参见第 25章和第 35章。

This flag is used mostly to tell the ICMP code that it should not reply to Address Mask Request Messages. The flag is set every time a call to fib_validate_source says that the source of the received packet is reachable with a next hop that has a local scope (RT_SCOPE_HOST). See Chapters 25 and 35 for more detail.

RTCF_SNAT

RTCF_SNAT

RTCF_DNAT

RTCF_DNAT

RTCF_NAT

RTCF_NAT

IPv4 不再使用这些标志。它们由已从 2.6 内核中删除的 FastNAT 功能使用(请参阅第 32 章中的“最近删除的选项” 部分)。

These flags are not used anymore by IPv4. They were used by the FastNAT feature that has been removed from the 2.6 kernels (see the section "Recently Dropped Options" in Chapter 32).

RTCF_BROADCAST

RTCF_BROADCAST

该路由的目的地址是广播地址。

The destination address of the route is a broadcast address.

RTCF_MULTICAST

RTCF_MULTICAST

该路由的目的地址是组播地址。

The destination address of the route is a multicast address.

RTCF_LOCAL

RTCF_LOCAL

路由的目标地址是本地的(即在本地接口之一上配置)。该标志还为本地广播和多播地址设置(请参阅 参考资料ip_route_input_mc)。

The destination address of the route is local (i.e., configured on one of the local interfaces). This flag is also set for local broadcast and multicast addresses (see ip_route_input_mc).

RTCF_REJECT

RTCF_REJECT

不曾用过。根据IPROUTE2的iprule命令的语法,有reject关键字,但是不接受。

Not used. According to the syntax of IPROUTE2's ip rule command, there is a reject keyword, but it is not accepted.

RTCF_TPROXY

RTCF_TPROXY

不曾用过。

Not used.

RTCF_DIRECTDST

RTCF_DIRECTDST

不曾用过。

Not used.

RTCF_FAST

RTCF_FAST

不曾用过。该标志已过时;它曾经被设置为将路由标记为符合快速切换的条件,该功能已在 2.6 内核中删除。

Not used. This flag is obsolete; it used to be set to mark a route as eligible for Fast Switching, a feature that has been dropped in the 2.6 kernels.

RTCF_MASQ

RTCF_MASQ

IPv4 不再使用。该标志应该标记来自伪装源地址的数据包。

Not used anymore by IPv4. The flag was supposed to mark packets coming from masqueraded source addresses.

unsigned rt_type
unsigned rt_type

路线类型。它间接定义当路由在路由查找上匹配时要采取的操作。该字段的可能值是include/linux/rtnetlink.hRTN_ XXX中定义并在表 36-12中列出的宏。

Type of route. It indirectly defines the action to take when the route matches on a routing lookup. The possible values for this field are the RTN_ XXX macros defined in include/linux/rtnetlink.h and listed in Table 36-12.

表 36-12。rt_type 的可能值

Table 36-12. Possible values for rt_type

航线类型

Route type

描述

Description

RTN_UNSPEC

RTN_UNSPEC

定义一个未初始化的值。例如,当从路由表中删除条目时,会使用该值,因为该操作不需要指定条目的类型。

Defines a noninitialized value. This value is used, for instance, when removing an entry from the routing table, because that operation does not require the type of entry to be specified.

RTN_LOCAL

RTN_LOCAL

目的地址是在本地接口上配置的。

The destination address is configured on a local interface.

RTN_UNICAST

RTN_UNICAST

该路由是到单播地址的直接或间接(通过网关)路由。当用户没有指定其他类型时,这是ip Route命令设置的默认值。

The route is a direct or indirect (via a gateway) route to a unicast address. This is the default value set by the ip route command when no other type is specified by the user.

RTN_MULTICAST

RTN_MULTICAST

目的地址是多播地址。

The destination address is a multicast address.

RTN_BROADCAST

RTN_BROADCAST

目的地址是广播地址。匹配的入口数据包作为广播在本地传送,匹配的出口数据包作为广播发送。

The destination address is a broadcast address. Matching ingress packets are delivered locally as broadcasts, and matching egress packets are sent as broadcasts.

RTN_ANYCAST

RTN_ANYCAST

匹配的入口数据包作为广播在本地传送,匹配的出口数据包作为单播发送。IPv4 不使用。

Matching ingress packets are delivered locally as broadcasts, and matching egress packets are sent as unicast. Not used by IPv4.

RTN_BLACKHOLE

RTN_BLACKHOLE

RTN_UNREACHABLE

RTN_UNREACHABLE

RTN_PROHIBIT

RTN_PROHIBIT

RTN_THROW

RTN_THROW

这些值与特定的管理配置相关,而不是与目标地址类型相关。请参阅第 30 章中的“路由类型和操作” 部分。

These values are associated with specific administrative configurations rather than destination address types. See the section "Route Types and Actions" in Chapter 30.

RTN_NAT

RTN_NAT

必须转换源和/或目标 IP 地址。未使用,因为相关功能 FastNAT 已在 2.6 内核中删除。

The source and/or destination IP address must be translated. Not used because the associated feature, FastNAT, has been dropped in the 2.6 kernels.

RTN_XRESOLVE

RTN_XRESOLVE

外部解析器将处理该路由。该功能目前尚未实现。

An external resolver will take care of this route. This functionality is currently not implemented.

_ _u16 rt_multipath_alg
_ _u16 rt_multipath_alg

多路径缓存算法。它根据关联路由上配置的算法进行初始化(请参阅“ fib_info 结构fib_mp_alg”部分)。

Multipath caching algorithm. It is initialized based on the algorithm configured on the associated route (see fib_mp_alg in the section "fib_info Structure").

_ _u32 rt_dst
_ _u32 rt_dst

_ _u32 rt_src
_ _u32 rt_src

目标和源 IP 地址。

Destination and source IP addresses.

int rt_iif
int rt_iif

入口设备ID。它的值是从net_device入口设备的数据结构中提取的。对于本地生成的流量(因此不在任何接口上接收),该字段设置为ifindex传出设备的字段。不要将此字段与本章后面描述的数据结构iif 的字段混淆。对于本地生成的流量,后一个字段设置为零 ( )。flowiflloopback_dev

ID of the ingress device. Its value is extracted from the net_device data structure of the ingress device. For traffic generated locally (and hence not received on any interface), the field is set to the ifindex field of the outgoing device. Do not confuse this field with the iif field of the flowi data structure fl described later in this chapter. The latter field is set to zero (loopback_dev) for locally generated traffic.

_ _u32 rt_gateway
_ _u32 rt_gateway

当目标主机直接连接时(on-link),rt_gateway匹配目标地址。当需要网关到达目的地时,rt_gateway设置为路由标识的下一跳网关。

When the destination host is directly connected (it is on-link), rt_gateway matches the destination address. When a gateway is needed to reach the destination, rt_gateway is set to the next hop gateway identified by the route.

struct flowi fl
struct flowi fl

用于缓存查找的搜索键,如“ flowi 结构”部分中所述。

Search key used for the cache lookups, described in the section "flowi Structure."

_ _u32 rt_spec_dst
_ _u32 rt_spec_dst

RFC 1122 特定的目的地,在第 35 章的“首选源地址选择”部分中进行了解释。

RFC 1122-specific destination, explained in the section "Preferred Source Address Selection" in Chapter 35.

struct inet_peer *peer
struct inet_peer *peer

第 19 章inet_peer中介绍的结构 存储有关 IP 对等体的长期信息,该对等体是具有此缓存路由的目标 IP 地址的主机。 本地主机最近与之通信的每个远程 IP 地址都有一个结构。inet_peer

The inet_peer structure, introduced in Chapter 19, stores long-living information about the IP peer, which is the host with the destination IP address of this cached route. There is an inet_peer structure for each remote IP address to which the local host has been talking in the recent past.

dst_entry结构

dst_entry Structure

该数据结构 dst_entry用于存储有关缓存路由的独立于协议的信息。L3 协议将自己的附加私有信息保存在单独的结构中。(例如,IPv4 使用rtable结构。)

The data structure dst_entry is used to store the protocol-independent information concerning cached routes. L3 protocols keep their own, additional private information in separate structures. (For example, IPv4 uses rtable structures.)

以下是逐个字段的描述:

Here is the field-by-field description:

struct dst_entry *next
struct dst_entry *next

用于链接dst_entry碰撞到同一个哈希表的桶中的实例。请参见第 33 章中的图 33-1

Used to link the dst_entry instances that collide into the same hash table's bucket. See Figure 33-1 in Chapter 33.

struct dst_entry *child
struct dst_entry *child

unsigned short header_len
unsigned short header_len

unsigned short trailer_len
unsigned short trailer_len

struct dst_entry *path
struct dst_entry *path

struct xfrm_state *xfrm
struct xfrm_state *xfrm

这些字段由 IPsec 代码使用。

These fields are used by IPsec code.

atomic_t _ _refcnt
atomic_t _ _refcnt

参考计数。请参阅第 33 章中的“删除 DST 条目”部分。

Reference count. See the section "Deleting DST Entries" in Chapter 33.

int _ _use
int _ _use

该条目已被使用的次数(即缓存查找返回该条目的次数)。不要将此值与 混淆rt_cache_stat[smp_processor_id( )].in_hit:后者(在“统计信息”部分中描述)表示设备的全局缓存命中数。

Number of times this entry has been used (i.e., number of times that a cache lookup has returned it). Do not confuse this value with rt_cache_stat[smp_processor_id( )].in_hit: the latter (described in the section "Statistics") represents the global number of cache hits for the device.

struct net_device *dev
struct net_device *dev

出口设备(即,从哪里传输以到达目的地)。

Egress device (i.e., where to transmit to reach the destination).

short obsolete
short obsolete

用于定义该dst_entry实例的可用性状态:0(默认值)表示该结构有效并且可以使用,2表示该结构正在被删除,因此无法使用,-1表示IPsec和IPv6使用但不使用通过 IPv4。

Used to define the usability status of this dst_entry instance: 0 (the default value) means the structure is valid and can be used, 2 means the structure is being removed and therefore cannot be used, and -1 is used by IPsec and IPv6 but not by IPv4.

int flags
int flags

一组标志。DST_HOST由 TCP 使用,表示该路由通向主机(即,它不是通向网络或广播/多播地址的路由)。DST_NOXFRMDST_NOPOLICY、 和DST_NOHASH仅由 IPsec 使用。

Set of flags. DST_HOST is used by TCP and means the route leads to a host (i.e., it is not a route to a network or a broadcast/multicast address). DST_NOXFRM, DST_NOPOLICY, and DST_NOHASH are used only by IPsec.

unsigned long lastuse
unsigned long lastuse

时间戳用于记住上次使用此条目的时间。当成功进行缓存查找时,它会被更新,并且垃圾收集例程使用它来选择要释放的最佳结构。

Timestamp used to remember the last time this entry was used. It is updated when there is a successful cache lookup and it is used by the garbage collection routines to select the best structures to free.

unsigned long expires
unsigned long expires

指示条目何时到期的时间戳。请参阅“过期标准”部分请参阅第 33 章”。

Timestamp that indicates when the entry will expire. See the section "Expiration Criteria" in Chapter 33.

u32 metrics[RTAX_MAX]
u32 metrics[RTAX_MAX]

度量向量,主要由 TCP 使用。该向量使用fib_info->fib_metrics向量的副本(如果已定义)进行初始化,并在需要时使用默认值。参见函数rt_set_nexthop第35章。有关向量可能值的说明,请参阅表 36-10 。

RTAX_LOCK值需要一些解释。RTAX_LOCK不是一个度量而是一个位图:当位置中的位n被设置时,这意味着具有枚举值的度量n已经配置了锁定选项/关键字。换句话说,像ip route add ... advmss lock ... 这样的命令设置 1<<RTAX_ADVMSS 位。当指标被锁定时,它不能被协议事件更改。

Vector of metrics, used mostly by TCP. This vector is initialized with a copy of the fib_info->fib_metrics vector (if it is defined), and default values are used where needed. See the function rt_set_nexthop and Chapter 35. See Table 36-10 for a description of the vector's possible values.

The RTAX_LOCK value needs a little explanation. RTAX_LOCK is not a metric but a bitmap: when the bit in position n is set, it means that the metric with enum value n has been configured with the lock options/keyword. In other words, a command like ip route add ... advmss lock ... sets the 1<<RTAX_ADVMSS bit. When a metric is locked, it cannot be changed by protocol events.

unsigned long rate_last
unsigned long rate_last

unsigned long rate_tokens
unsigned long rate_tokens

这两个字段用于对两种类型的ICMP消息进行速率限制。请参阅第 33 章中的“出口 ICMP 重定向速率限制”部分和第 35 章中的“路由失败”部分。

These two fields are used to rate limit two types of ICMP messages. See the section "Egress ICMP REDIRECT Rate Limiting" in Chapter 33 and the section "Routing Failure" in Chapter 35.

short error
short error

fib_lookupAPI(仅由 IPv4 使用)失败时,错误将被保存到error(带有正号)并稍后用于ip_error决定如何处理失败(即,决定生成哪个 ICMP)。

When the fib_lookup API (used only by IPv4) fails, the error is saved into error (with a positive sign) and used later by ip_error to decide how to handle the failure (i.e., to decide which ICMP to generate).

struct neighbour *neighbour
struct neighbour *neighbour

struct hh_cache *hh
struct hh_cache *hh

neighbour是包含下一跳的 L3 到 L2 地址映射的数据结构。hh是缓存的 L2 标头。详细信息请参见第六部分的章节。

neighbour is the data structure that contains the L3-to-L2 address mapping for the next hop. hh is the cached L2 header. See the chapters in Part VI for details.

int (*input)(struct sk_buff*)
int (*input)(struct sk_buff*)

int (*output)(struct sk_buff**)
int (*output)(struct sk_buff**)

分别用于处理入口和出口数据包的函数。请参阅第 33 章中的“高速缓存查找” 部分。

Functions used to process ingress and egress packets, respectively. See the section "Cache Lookup" in Chapter 33.

_ _u32 tclassid
_ _u32 tclassid

基于路由表的分类器标签。请参阅第 35 章中的“策略路由和基于路由表的分类器”部分

Routing table based classifier's tag. See the section "Policy Routing and Routing Table Based Classifier" in Chapter 35.

struct dst_ops *ops
struct dst_ops *ops

VFT其函数用于操作dst_entry结构。

VFT whose functions are used to manipulate dst_entry structures.

struct rcu_head rcu_head
struct rcu_head rcu_head

照顾互斥。

Takes care of mutual exclusion.

char info[0]
char info[0]

该字段可用作指向数据结构末尾的指针。它只是一个占位符。

This field can be useful as a pointer to the end of the data structure. It is only a placeholder.

dst_ops结构

dst_ops Structure

dst_ops结构是独立于协议的缓存和使用路由缓存的 L3 协议之间的接口。请参见第33 章中的“ DST 和呼叫协议之间的接口”部分。以下是逐个字段的描述:

The dst_ops structure is the interface between the protocol-independent cache and L3 protocols that use a routing cache. See the section "Interface Between the DST and Calling Protocols" in Chapter 33. Here is the field-by-field description:

unsigned short family
unsigned short family

称呼家人。请参阅include/linux/socket.hAF_ XXX中的值。

Address family. See AF_ XXX values in include/linux/socket.h.

unsigned short protocol
unsigned short protocol

协议ID。请参阅include/linux/if_ether.hETH_P_ XXX中的值。

Protocol ID. See ETH_P_ XXX values in include/linux/if_ether.h.

unsigned gc_thresh
unsigned gc_thresh

该字段由垃圾收集算法使用,指定路由缓存的大小(桶数)。初始化是在ip_rt_init(IPv4路由子系统初始化函数)中完成的。

This field, used by the garbage collection algorithm, specifies the size (number of buckets) of the routing cache. The initialization is done in ip_rt_init (the IPv4 routing subsystem initialization function).

int (*gc)(void)
int (*gc)(void)

atomic_t entries
atomic_t entries

gcdst_alloc是当协议已分配的实例数 dst_entry( entries)大于或等于阈值时调用的垃圾收集函数gc_thresh

gc is the garbage collection function invoked by dst_alloc when the number of dst_entry instances (entries) already allocated by the protocol is greater than or equal to the threshold gc_thresh.

struct dst_entry * (*check)(struct dst_entry *, _ _u32 cookie)
struct dst_entry * (*check)(struct dst_entry *, _ _u32 cookie)

void (*destroy)(struct dst_entry *)
void (*destroy)(struct dst_entry *)

void (*ifdown)(struct dst_entry *, struct net_device *dev, int how)
void (*ifdown)(struct dst_entry *, struct net_device *dev, int how)

struct dst_entry * (*negative_advice)(struct dst_entry *)
struct dst_entry * (*negative_advice)(struct dst_entry *)

void (*link_failure)(struct sk_buff *)
void (*link_failure)(struct sk_buff *)

void (*update_pmtu)(struct dst_entry *dst, u32 mtu)
void (*update_pmtu)(struct dst_entry *dst, u32 mtu)

int (*get_mss)(struct dst_entry *dst, u32 mtu)
int (*get_mss)(struct dst_entry *dst, u32 mtu)

请参见第33 章中的“ DST 和呼叫协议之间的接口”部分。

See the section "Interface Between the DST and Calling Protocols" in Chapter 33.

int entry_size
int entry_size

外部 L3 路由缓存结构的大小(例如,rtable对于 IPv4)。

Size of the outer L3 routing cache structure (e.g., rtable for IPv4).

kmem_cache_t *kmem_cachep
kmem_cache_t *kmem_cachep

内存池用于分配路由缓存元素。

Memory pool used to allocate routing cache elements.

流动结构

flowi Structure

利用该flowi数据结构,可以根据入口和出口设备、L3 和 L4 协议头的参数等字段的组合来定义流量类别。它通常用作查找的搜索键,作为IPsec 策略和其他高级用途的流量选择器。下面简单介绍一下它的字段:

With the flowi data structure, it is possible to define classes of traffic based on the combination of fields such as ingress and egress devices, parameters from the L3 and L4 protocol headers, etc. It is commonly used as a search key for lookups, as a traffic selector for IPsec policies, and other advanced uses. Here is a brief description of its fields:

int oif
int oif

int iif
int iif

出口和入口设备 ID。

Egress and ingress device IDs.

union {...} nl_u
union {...} nl_u

Union 的字段是可用于指定 L3 参数值的结构体。目前支持的协议有 IPv4、IPv6 和 DECnet。

Union whose fields are structures that can be used to specify the values of L3 parameters. The protocols currently supported are IPv4, IPv6, and DECnet.

_ _u8 proto
_ _u8 proto

L4 协议。

L4 protocol.

_ _u8 flags
_ _u8 flags

该变量中定义的唯一标志FLOWI_FLAG_MULTIPATHOLDROUTE最初由多路径代码使用,但不再使用。

The only flag defined in this variable, FLOWI_FLAG_MULTIPATHOLDROUTE, originally was used by the multipath code, but it is not used anymore.

union {...} uli_u
union {...} uli_u

Union的字段主要是结构体,可以用来指定L4参数的值。当前支持的协议包括 TCP、UDP、ICMP、DECnet 和 IPsec 套件。

Union whose fields are mainly structures that can be used to specify the values of L4 parameters. The protocols currently supported are TCP, UDP, ICMP, DECnet, and the IPsec suite.

由于数据结构不是平面的,而是包含联合和结构,因此内核提供了一组可用于访问其某些字段的宏。

Because the data structure is not flat, but contains unions and structs, the kernel provides a set of macros that can be used to access some of its fields.

rt_cache_stat结构

rt_cache_stat Structure

rt_cache_stat存储用于“统计”部分中介绍的统计的计数器。这是它的计数器:

rt_cache_stat stores the counters used for the statistics introduced in the section "Statistics." Here are its counters:

in_hit
in_hit

out_hit
out_hit

通过在路由缓存上成功查找而路由的已接收数据包和本地生成数据包的数量(请参阅 和ip_route_inputip_route_output_key

Number of received and locally generated packets, respectively, that have been routed with a successful lookup on the routing cache (see ip_route_input and ip_route_output_key).

in_slow_tot
in_slow_tot

in_slow_mc
in_slow_mc

in_slow_tot是由于缓存查找失败而需要在路由表中查找的数据包数量(请参阅 ip_route_input_slow)。仅计算成功的路由表查找。该计数器被称为慢速,因为对路由表的查找可能比对路由缓存的查找慢得多。该计数器包括广播,但不包括在 中计数的多播流量in_slow_mc

in_slow_tot is the number of packets that required a lookup on the routing table because the cache lookup failed (see ip_route_input_slow). Only successful routing table lookups are counted. The counter is called slow because a lookup on the routing tables can be much slower than a lookup on the routing cache. This counter includes broadcasts, but it does not include multicast traffic, which is counted in in_slow_mc.

out_slow_tot
out_slow_tot

out_slow_mc
out_slow_mc

out_slow_tot并与出口流量out_slow_mc发挥相同的作用in_slow_totin_slow_mc

out_slow_tot and out_slow_mc play the same role as in_slow_tot and in_slow_mc for the egress traffic

in_no_route
in_no_route

由于路由表不知道如何到达目标 IP 地址而无法转发的入口数据包数量(只有在没有配置或不使用默认网关的情况下才可能实现)。见ip_route_input_slow。没有计数器来跟踪由于缺少路由而无法发送的本地生成的数据包。

Number of ingress packets that could not be forwarded because the routing table did not know how to reach the destination IP address (which is possible only if no default gateway is configured or usable). See ip_route_input_slow. There is no counter to keep track of the locally generated packets that could not be sent for lack of a route.

in_brd
in_brd

正确接收的广播数据包数量(健全性检查未失败)。没有用于计算已传输广播数量的计数器。

Number of broadcast packets received correctly (no sanity check failed). There is no counter for the number of transmitted broadcasts.

in_martian_dst
in_martian_dst

in_martian_src
in_martian_src

这两个计数器分别表示由于目标或源 IP 地址的健全性检查失败而被丢弃的数据包数量。健全性检查的示例包括源 IP 地址不能多播或广播,以及目标地址不能属于所谓的零网络,即它不能看起来像 0 nnn

These two counters represent the number of packets that were dropped because the sanity check failed on the destination or source IP addresses, respectively. Examples of sanity checks are that the source IP address cannot be multicast or broadcast and that the destination address cannot belong to the so-called zero-network—that is, it cannot look like 0.n.n.n.

gc_total
gc_total

gc_ignored
gc_ignored

gc_goal_miss
gc_goal_miss

gc_dst_overflow
gc_dst_overflow

这四个字段由第33 章“ rt_garbage_collect 函数rt_garbage_collect部分中描述的更新。

gc_totalrt_garbage_collect跟踪被调用的次数 。

gc_ignored是次数rt_garbage_collect是因为最近调用而立即返回的

gc_goal_miss是缓存被扫描的次数rt_garbage_collect 是在未达到函数开始时设定的目标的情况

gc_dst_overflowgc_garbage_collect是因未将缓存条目数减少到阈值以下而失败的次数 ip_rt_max_size

These four fields are updated by rt_garbage_collect, described in the section "rt_garbage_collect Function" in Chapter 33.

gc_total keeps track of the number of times rt_garbage_collect is invoked.

gc_ignored is the number of times rt_garbage_collect returns immediately because it was called too recently.

gc_goal_miss is the number of times the cache has been scanned by rt_garbage_collect without meeting the goal set at the beginning of the function.

gc_dst_overflow is the number of times gc_garbage_collect fails by not reducing the number of cache entries below the ip_rt_max_size threshold.

in_hlist_search
in_hlist_search

out_hlist_search
out_hlist_search

ip_route_input它们分别由用于缓存查找的例程进行更新_ _ip_route_output_key,。它们表示已测试且不匹配的缓存元素的数量(不仅仅是缓存未命中的数量)。

These are updated by the routines used for the cache lookups, ip_route_input and _ _ip_route_output_key, respectively. They represent the number of cache elements that have been tested and did not match (not just the number of cache misses).

ip_mp_alg_ops 结构

ip_mp_alg_ops Structure

ip_mp_alg_ops表示路由缓存和多路径缓存功能之间的接口。它由以下函数指针组成:

ip_mp_alg_ops represents the interface between the routing cache and the Multipath caching feature. It consists of the following function pointers:

void (*mp_alg_select_route) (const struct flowi *flp, struct rtable *rth, struct rtable **rp)
void (*mp_alg_select_route) (const struct flowi *flp, struct rtable *rth, struct rtable **rp)

void (*mp_alg_flush) (void)
void (*mp_alg_flush) (void)

void (*mp_alg_set_nhinfo) (_ _u32 network, _ _u32 netmask, unsigned char prefixlen, const struct fib_nh *nh)
void (*mp_alg_set_nhinfo) (_ _u32 network, _ _u32 netmask, unsigned char prefixlen, const struct fib_nh *nh)

void (*mp_alg_remove) (struct rtable *rth)
void (*mp_alg_remove) (struct rtable *rth)

这些函数由第 33 章“路由缓存和多路径之间的接口”部分中描述的与算法无关的包装器调用。

These functions are invoked by the algorithm-independent wrappers described in the section "Interface Between the Routing Cache and Multipath" in Chapter 33.

本书这一部分介绍的函数和变量

Functions and Variables Featured in This Part of the Book

表 36-13总结了本书涉及路由子系统的章节中介绍或引用的主要函数、变量和数据结构。您可以在第 32 章的“通用帮助例程和宏”和“帮助例程”部分以及第 32 章中的两个“帮助例程”中找到更多信息。以及第 35 章部分中找到更多信息。

Table 36-13 summarizes the main functions, variables, and data structures introduced or referenced in the chapters of this book covering the routing subsystem. You can find more in the section "Generic Helper Routines and Macros" and "Helper Routines" in Chapter 32, and the two "Helper Routines", in Chapter 35.

表 36-13。路由子系统中的函数、变量和数据结构

Table 36-13. Functions, variables, and data structures in the routing subsystem

功能

Functions

 

for_ifa, endfor_ifa

for_ifa, endfor_ifa

for_primary_ifa, endfor_ifa

for_primary_ifa, endfor_ifa

用于浏览网络设备上配置的 IPv4 地址的宏。请参阅第 32 章中的“主 IP 地址和辅助 IP 地址”部分。

Macros used to browse the IPv4 addresses configured on a network device. See the section "Primary and Secondary IP Addresses" in Chapter 32.

FIB_RES_ XXX

FIB_RES_ XXX

用于访问结构体字段的宏集fib_result请参阅第 32 章中的“通用帮助例程和宏”部分。

Set of macros used to access the fields of the fib_result structure. See the section "Generic Helper Routines and Macros" in Chapter 32.

LOOPBACK

LOOPBACK

ZERONET

ZERONET

MULTICAST

MULTICAST

LOCAL_MCAST/BADCLASS

LOCAL_MCAST/BADCLASS

用于识别特殊IP地址的宏。请参阅第 32 章中的“通用帮助例程和宏”部分。

Macros used to recognize special IP addresses. See the section "Generic Helper Routines and Macros" in Chapter 32.

fib_hash_lock

fib_hash_lock

fib_info_lock

fib_info_lock

fib_rules_lock

fib_rules_lock

rt_flush_lock

rt_flush_lock

fib_multipath_lock

fib_multipath_lock

alg_table_lock

alg_table_lock

锁用于保护各种数据。请参阅第 32 章中的“全局锁”部分。

Locks used to protect various pieces of data. See the section "Global Locks" in Chapter 32.

ip_rt_init

ip_rt_init

ip_fib_init

ip_fib_init

devinet_init

devinet_init

fib_rules_init

fib_rules_init

fib_hash_init

fib_hash_init

dst_init

dst_init

初始化例程。请参见第 32 章中的“路由子系统初始化”部分。

Initialization routines. See the section "Routing Subsystem Initialization" in Chapter 32.

dst_alloc

dst_alloc

为路由缓存分配一个条目。请参阅第 33 章中的“高速缓存条目分配和引用计数”部分。

Allocate an entry for the routing cache. See the section "Cache Entry Allocation and Reference Counts" in Chapter 33.

rt_periodic_timer

rt_periodic_timer

rt_secret_timer

rt_secret_timer

计时器。请参阅第 33 章中的“垃圾收集”和“刷新路由缓存”部分。

Timers. See the sections "Garbage Collection" and "Flushing the Routing Cache" in Chapter 33.

fib_netdev_event

fib_netdev_event

fib_inetaddr_event

fib_inetaddr_event

netdev_chain和 通知链的处理程序inetaddr_chain请参阅第 32 章中的“外部事件” 部分。

Handlers for the netdev_chain and inetaddr_chain notification chains. See the section "External Events" in Chapter 32.

fib_add_ifaddr

fib_add_ifaddr

fib_del_ifaddr

fib_del_ifaddr

用于在从本地网络设备的配置中添加或删除 IP 地址时更新路由表。请参阅第 32 章中的“添加 IP 地址”和“删除 IP 地址”部分。

Used to update the routing table upon the addition or removal of an IP address from the configuration of a local network device. See the sections "Adding an IP address" and "Removing an IP address" in Chapter 32.

fib_magic

fib_magic

由内核用于在特定条件下插入路由。请参阅“内核插入的路由:fib_magic 函数”部分。

Used by the kernel to insert routes under specific conditions. See the section "Routes Inserted by the Kernel: The fib_magic Function."

fib_rules_detach

fib_rules_detach

fib_rules_attach

fib_rules_attach

分别在网络设备注册和取消注册时启用和禁用路由策略。请参阅第 32 章中的“对策略数据库的影响”部分。

Enables and disables routing policies when network devices are registered and unregistered, respectively. See the section "Impacts on the policy database" in Chapter 32.

rtmsg_fib

rtmsg_fib

用于在添加或删除路由时在特定 Netlink 多播组上发送通知。请参阅第 32 章中的“ Netlink 通知”部分。

Used to send notification on a specific Netlink multicast group when routes are added or removed. See the section "Netlink Notifications" in Chapter 32.

ip_route_input

ip_route_input

_ _ip_route_output_key ip_route_output_flow

_ _ip_route_output_key ip_route_output_flow

ip_route_output_key

ip_route_output_key

ip_route_connect

ip_route_connect

ip_route_newports

ip_route_newports

前两个函数是路由缓存查找例程,其他函数是它们的包装器。请参阅第 33 章中的“高速缓存查找”部分。

The first two functions are routing cache lookup routines, and the others are wrappers around them. See the section "Cache Lookup" in Chapter 33.

ip_route_input_slow

ip_route_input_slow

ip_route_output_slow

ip_route_output_slow

路由表查找例程。参见第 35 章

Routing table lookup routines. See Chapter 35.

ip_route_input_mc

ip_route_input_mc

用于多播目的地的查找例程。

Lookup routines used for multicast destinations.

ip_mkroute_input

ip_mkroute_input

ip_mkroute_input_def

ip_mkroute_input_def

ip_mkroute_output

ip_mkroute_output

ip_mkroute_output_def

ip_mkroute_output_def

fib_select_default

fib_select_default

fib_select_multipath

fib_select_multipath

ip_route_input_slow和使用的各种支持例程ip_route_output_slow。参见第 35 章

Various support routines used by ip_route_input_slow and ip_route_output_slow. See Chapter 35.

fib_lookup

fib_lookup

fn_hash_lookup

fn_hash_lookup

fib_semantic_match

fib_semantic_match

在路由表查找期间的不同阶段调用的例程。请参阅第 35 章中的“查找函数的高级视图”部分。

Routines called at different stages during a routing table lookup. See the section "High-Level View of Lookup Functions" in Chapter 35.

fn_hash_insert

fn_hash_insert

将新路由添加到路由表中。请参见第 34 章中的“添加路由”部分。

Add a new route to a routing table. See the section "Adding a Route" in Chapter 34.

fn_hash_delete

fn_hash_delete

从路由表中删除一条路由。请参见第34 章中的“删除路由”部分。

Remove a route from a routing table. See the section "Deleting a Route" in Chapter 34.

rt_intern_hash

rt_intern_hash

将条目添加到路由缓存。请参阅第 33 章中的“向缓存添加元素”部分。

Add an entry to the routing cache. See the section "Adding Elements to the Cache" in Chapter 33.

multipath_alg_register

multipath_alg_register

multipath_alg_unregister

multipath_alg_unregister

注册和取消注册多路径缓存算法。请参阅第 33 章中的“注册缓存算法”部分。

Register and unregister a multipath caching algorithm. See the section "Registering a Caching Algorithm" in Chapter 33.

multipath_select_route

multipath_select_route

multipath_flush

multipath_flush

multipath_set_nhinfo

multipath_set_nhinfo

multipath_remove

multipath_remove

用于管理与多路径路由关联的缓存条目的各种例程。请参阅第 33 章中的“路由缓存和多路径之间的接口”部分。更多例程在同一章的“辅助例程”部分中列出。

Various routines used to manage cache entries associated with multipath routes. See the section "Interface Between the Routing Cache and Multipath" in Chapter 33. More routines are listed in the section "Helper Routines" in the same chapter.

rt_free

rt_free

dst_free

dst_free

分别释放 anrtable和 adst_entry结构。

Free an rtable and a dst_entry structure, respectively.

rt_garbage_collect

rt_garbage_collect

rt_may_expire

rt_may_expire

用于路由缓存的垃圾收集例程。请参阅第 33 章中的“ rt_garbage_collect 函数”部分。

Garbage collection routines used for the routing cache. See the section "rt_garbage_collect Function" in Chapter 33.

dst_input

dst_input

dst_output

dst_output

分别完成一个数据包的接收和发送。请参阅第 33 章中的“高速缓存查找” 部分。另请参见第35 章中的“设置接收和发送功能”部分。

Complete the reception and transmission of a packet, respectively. See the section "Cache Lookup" in Chapter 33. See also the section "Setting Functions for Reception and Transmission" in Chapter 35.

rt_garbage_collect

rt_garbage_collect

dst_destroy

dst_destroy

dst_ifdown

dst_ifdown

dst_negative_advice

dst_negative_advice

dst_link_failure

dst_link_failure

dst_set_expires

dst_set_expires

用于初始化dst_ops与 IPv4 协议关联的实例的例程。请参见第33 章中的“ DST 和呼叫协议之间的接口”部分。

Routines used for the initialization of the dst_ops instance associated with the IPv4 protocol. See the section "Interface Between the DST and Calling Protocols" in Chapter 33.

dst_dev_event

dst_dev_event

DST 子系统使用的处理程序来处理来自netdev_chain通知链的通知。请参阅第 32 章中的“外部事件”部分。

Handler used by the DST subsystem to process notifications from the netdev_chain notification chain. See the section "External Events" in Chapter 32.

RT_CACHE_STAT_INC

RT_CACHE_STAT_INC

更新每个 CPU 的统计信息。请参阅“统计”部分。

Update per-CPU statistics. See the section "Statistics."

变量

Variables

 

ip_fib_local_table

ip_fib_local_table

ip_fib_main_table

ip_fib_main_table

路由表。请参阅第 34 章中的“两个默认路由表:ip_fib_main_table 和 ip_fib_local_table ”部分。

Routing tables. See the section "The Two Default Routing Tables: ip_fib_main_table and ip_fib_local_table" in Chapter 34.

rt_hash_table

rt_hash_table

路由缓存。参见第 33 章

Routing cache. See Chapter 33.

rt_hash_mask

rt_hash_mask

路由缓存的大小(即哈希表的桶数)。

Size of the routing cache (i.e., number of buckets of the hash table).

dst_garbage_list

dst_garbage_list

dst_entry由于仍被引用而无法删除的实例列表。参见第 33 章

List of dst_entry instances that cannot be removed because they are still referenced. See Chapter 33.

fib_tables

fib_tables

实例列表fib_table。请参见第 34 章中的图 34-1

List of fib_table instances. See Figure 34-1 in Chapter 34.

fib_rules

fib_rules

路由策略列表。请参阅第 35 章中的“带有策略路由的 fib_lookup ”部分。

List of routing policies. See the section "fib_lookup with Policy Routing" in Chapter 35.

fib_info_cnt

fib_info_cnt

未完成实例的数量fib_info请参阅第 34 章中的“动态调整全局哈希表的大小”部分。

Number of outstanding fib_info instances. See the section "Dynamic resizing of global hash tables" in Chapter 34.

fib_info_hash

fib_info_hash

fib_info_laddrhash

fib_info_laddrhash

用于搜索fib_info 实例的哈希表。请参阅第 34 章中的“ fib_info 结构的组织”部分。

Hash tables used to search fib_info instances. See the section "Organization of fib_info Structures" in Chapter 34.

fib_info_devhash

fib_info_devhash

哈希表用于搜索fib_nh 实例。请参见第 34 章中的“下一跳路由器结构的组织”部分。

Hash table used to search fib_nh instances. See the section "Organization of Next-Hop Router Structures" in Chapter 34.

fib_props

fib_props

查找例程使用其元素fib_semantic_match将路由类型映射到返回值的向量。请参阅第 35 章中的“ fib_semantic_match 的返回值”部分。

Vector whose elements are used by the lookup routine fib_semantic_match to map route types to return values. See the section "Return value from fib_semantic_match" in Chapter 35.

数据结构

Data structures

 

fib_table structure

fib_table structure

fn_zone structure

fn_zone structure

fib_node structure

fib_node structure

fib_alias structure

fib_alias structure

fib_info structure

fib_info structure

fib_nh structure

fib_nh structure

fib_rule structure

fib_rule structure

rtable structure

rtable structure

dst_entry structure

dst_entry structure

dst_ops structure

dst_ops structure

flowi structure

flowi structure

rt_cache_stat structure

rt_cache_stat structure

ip_mp_alg_ops structure

ip_mp_alg_ops structure

路由代码使用的关键数据结构。它们在“本书本部分介绍的数据结构”部分中进行了详细描述。

Key data structures used by the routing code. They are described in detail in the section "Data Structures Featured in This Part of the Book."

本书这一部分介绍的文件和目录

Files and Directories Featured in This Part of the Book

图36-6列出了第七部分各章中提到的文件和目录。

Figure 36-6 lists the files and directories referred to in the chapters in Part VII.

本书这一部分中的文件和目录

图 36-6。本书这一部分中的文件和目录

Figure 36-6. Files and directories featured in this part of the book




[ * ]请参阅net-tools包中的 lib/inet_gr.c文件。

[*] See the file lib/inet_gr.c in the net-tools package.

[ * ]有关用户权限和进程功能的更多详细信息,请参阅了解 Linux 内核(O'Reilly)。

[*] For more details on user privileges and process capabilities, refer to Understanding the Linux Kernel (O'Reilly).

[ * ]当前并非所有更改都会生成通知。例如,当设备出现故障时,相关 IPv4 路由的删除不会传达给 IPv4 协议。然而,这种行为可能会改变。例如,IPv6 已经被通知。

[*] Not all changes currently generate notifications. For example, when a device goes down, removal of the associated IPv4 routes is not communicated to the IPv4 protocol. This behavior could change, however. For example, IPv6 is already notified.

[ * ]请考虑到内核设置的默认值可能与启动 Linux 系统时获得的默认值不同。sysctl原因是每个 Linux 发行版都可以在启动时通过初始化文件和脚本自由更改每个变量的默认值。例如,请参见/etc/sysctl.conf。此外,不同的内核版本可以使用不同的默认值。

[*] Take into account that the default value set by the kernel may be different from the default value you get when you boot a Linux system. The reason is that each Linux distribution is free to change the default value of each sysctl variable at boot time by means of the initialization files and scripts. See, for instance, /etc/sysctl.conf. Also, different kernel versions could use different default values.

[ * ]该文件实际上具有读取权限,但如果您尝试读取其内容,内核会抱怨。

[*] The file actually has read permissions, but if you try reading its contents, the kernel complains.

[ * ]大多数表示时间段的参数都是以秒为单位配置的,但存储在jiffies(秒数 * HZ)中。当您通过转储关联文件的内容来读取这些值时,您可能会在几秒钟内获得该值,或者jiffies取决于内核用于转储它们的例程(例如,proc_handler例程)。例如,proc_dointvec按原样打印内核值(假设它是整数值),其中 as将假设以(即刻度)proc_dointvec_jiffies表示的值转换为秒以进行显示。jiffies

[*] Most of the parameters that represent periods of time are configured in seconds, but stored in jiffies (the number of seconds * HZ). When you read these values by dumping the contents of the associated files, you may get the value in seconds or jiffies depending on what routine the kernel uses to dump them (e.g., proc_handler routine). For example, proc_dointvec prints the kernel value as is (with the assumption that it is an integer value), where as proc_dointvec_jiffies converts a value assumed to be expressed in jiffies (i.e., ticks) to seconds for display.

[ * ]另请参阅“启用和禁用转发”部分。

[*] See also the section "Enabling and Disabling Forwarding."

[ * ] IPv6 使用rt6_info,和 DECnet(本书未涵盖)使用dn_route

[*] IPv6 uses rt6_info, and DECnet (not covered in this book) uses dn_route.

关于作者

About the Author

Christian Benvenuti 在意大利博洛尼亚大学获得计算机科学硕士学位。他与的里雅斯特国际理论物理中心 (ICTP) 合作了几年,在那里他开发了基于 Linux 内核的临时软件,担任远程协作项目的科学顾问,并担任多个项目的讲师。网络培训课程。这些培训主要在欧洲、非洲和南美举办,均基于Linux系统,面向来自发展中国家的科学家,ICTP多年来一直在这些国家推广Linux。他偶尔会与 ICTP 成员创建的非营利组织 Collaborium.org 合作,继续在发展中国家推广 Linux。

Christian Benvenuti received his masters degree in Computer Science at the University of Bologna in Italy. He collaborated for a few years with the International Center for Theoretical Physics (ICTP) in Trieste, where he developed ad-hoc software based on the Linux kernel, was a scientific consultant for a project on remote collaboration, and served as an instructor for several training sessions on networking. The trainings, held mainly in Europe, Africa, and South America wereall based on Linux systems and addressed to scientists from developing countries, where the ICTP has been promoting Linux for many years. He occasionally collaborates with a non-profit organization founded by ICTP members, Collaborium.org, to continue promoting Linux on developing countries.

在过去的几年里,他在硅谷的思科系统公司担任软件工程师,专注于第二层交换、高可用性和网络安全。

In the past few years he worked as a software engineer for Cisco Systems in the Silicon Valley, where he focused on Layer two switching, high availability, and network security.

版画

Colophon

我们的外观是读者评论、我们自己的实验以及发行渠道反馈的结果。独特的封面补充了我们处理技术主题的独特方法,为潜在的枯燥主题注入个性和生命力。

Our look is the result of reader comments, our own experimentation, and feedback from distribution channels. Distinctive covers complement our distinctive approach to technical topics, breathing personality and life into potentially dry subjects.

Philip Dangler 是《Understanding Linux Network Internals》的制作编辑,Audrey Doyle 是文案编辑 。萨达·普雷施 (Sada Preisch) 校对了这本书。玛丽·布雷迪和科琳·戈尔曼负责质量控制。Rachel Monaghan、Lydia Onofrei 和 Laurel Ruma 提供了制作协助。安吉拉·霍华德撰写了该索引。

Philip Dangler was the production editor, and Audrey Doyle was the copyeditor for Understanding Linux Network Internals. Sada Preisch proofread the book. Mary Brady and Colleen Gorman provided quality control. Rachel Monaghan, Lydia Onofrei, and Laurel Ruma provided production assistance. Angela Howard wrote the index.

凯伦·蒙哥马利根据汉娜·戴尔和埃迪·弗里德曼的系列设计设计了这本书的封面。封面图片是来自《男人:来自 19 世纪资料的图画档案》的 19 世纪版画。Karen Montgomery 使用 Adob​​e ITC Garamond 字体通过 Adob​​e InDesign CS 制作了封面布局。

Karen Montgomery designed the cover of this book, based on a series design by Hanna Dyer and Edie Freedman. The cover image is a 19th-century engraving from Men: A Pictorial Archive from 19th Century Sources). Karen Montgomery produced the cover layout with Adobe InDesign CS using Adobe's ITC Garamond font.

大卫·富塔托 (David Futato) 设计了室内布局。章节开头的图片来自《 男人:十九世纪资料图片档案》。Keith Fahlgren 使用 Erik Ray、Jason McIntosh、Neil Walls 和 Mike Sierra 创建的使用 Perl 和 XML 技术的格式转换工具将本书转换为 FrameMaker 5.5.6。文字字体为Linotype Birka;标题字体为 Adob​​e Myriad Condensed;代码字体是 LucasFont 的 TheSans Mono Condensed。书中出现的插图由 Robert Romano、Jessamyn Read 和 Lesley Borash 使用 Macromedia FreeHand MX 和 Adob​​e Photoshop CS 制作。提示和警告图标由 Christopher Bing 绘制。

David Futato designed the interior layout. The chapter opening images are from Men: A Pictorial Archive from 19th Century Sources. This book was converted by Keith Fahlgren to FrameMaker 5.5.6 with a format conversion tool created by Erik Ray, Jason McIntosh, Neil Walls, and Mike Sierra that uses Perl and XML technologies. The text font is Linotype Birka; the heading font is Adobe Myriad Condensed; and the code font is LucasFont's TheSans Mono Condensed. The illustrations that appear in the book were produced by Robert Romano, Jessamyn Read, and Lesley Borash using Macromedia FreeHand MX and Adobe Photoshop CS. The tip and warning icons were drawn by Christopher Bing.

了解 Linux 网络内部结构

Understanding Linux Network Internals

克里斯蒂安·本韦努蒂

Christian Benvenuti

编辑

Editor

安迪·奥拉姆

Andy Oram

奥莱利媒体

格拉文斯坦公路北1005号

1005 Gravenstein Highway North

塞瓦斯托波尔, CA 95472

Sebastopol, CA 95472

2012-08-20T01:53:50-07:00

2012-08-20T01:53:50-07:00